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Abstract 

Undirected graphs are often used to describe high dimensional distributions. Under sparsity 
conditions, the graph can be estimated using l\ -penalization methods. We propose and study the 
following method. We combine a multiple regression approach with ideas of thresholding and 
refitting: first we infer a sparse undirected graphical model structure via thresholding of each 
among many l\ -norm penalized regression functions; we then estimate the covariance matrix and 
its inverse using the maximum likelihood estimator. We show that under suitable conditions, this 
approach yields consistent estimation in terms of graphical structure and fast convergence rates 
with respect to the operator and Frobenius norm for the covariance matrix and its inverse. We 
also derive an explicit bound for the Kullback Leibler divergence. 

Keywords: Graphical model selection, covariance estimation, Lasso, nodewise regression, 
thresholding 



1. Introduction 

There have been a lot of recent activities for estimation of high-dimensional covariance and inverse 
covariance matrices where the dimension p of the matrix may greatly exceed the sample size n. 
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High-dimensional covariance estimation can be classified into two main categories, one which re- 
lies on a natural ordering among the variables [Bickel and Levina, 2004, 2008; Furrer and Bengtsson, 
2007; Huang et al., 2006; Levina et al., 2008; Wu and Pourahmadi, 2003] and one where no nat- 
ural ordering is given and estimators are permutation invariant with respect to indexing the vari- 
ables [Banerjee et al, 2008; d'Aspremont et al., 2008; Friedman et al., 2007; Rothman et al., 2008; 
Yuan and Lin, 2007]. We focus here on the latter class with permutation invariant estimation and 
we aim for an estimator which is accurate for both the covariance matrix £ and its inverse, the 
precision matrix £ . A popular approach for obtaining a permutation invariant estimator which 
is sparse in the estimated precision matrix £ _1 is given by the ^i-norm regularized maximum- 
likelihood estimation, also known as the GLasso [Banerjee et al., 2008; Friedman et al., 2007; 
Yuan and Lin, 2007]. The GLasso approach is simple to use, at least when relying on publicly 
available software such as the glasso package in R. Further improvements have been reported 
when using some SCAD-type penalized maximum-likelihood estimator [Lam and Fan, 2009] or 
an adaptive GLasso procedure [Fanetal., 2009], which can be thought of as a two-stage pro- 
cedure. It is well-known from linear regression that such two- or multi-stage methods effec- 
tively address some bias problems which arise from l\ -penalization [Buhlmann and Meier, 2008; 
Candes and Tao, 2007; Meinshausen, 2007; Zhou, 2009, 2010b; Zou, 2006; Zou and Li, 2008]. 

In this paper we develop a new method for estimating graphical structure and parameters for 
multivariate Gaussian distributions using a multi-step procedure, which we call Gelato (Graph 
estimation with Lasso and Thresholding). Based on an £i-norm regularization and thresholding 
method in a first stage, we infer a sparse undirected graphical model, i.e. an estimated Gaussian 
conditional independence graph, and we then perform unpenalized maximum likelihood estima- 
tion (MLE) for the covariance £ and its inverse £ -1 based on the estimated graph. We make the 
following theoretical contributions: (i) Our method allows us to select a graphical structure which 
is sparse. In some sense we select only the important edges even though there may be many non- 
zero edges in the graph, (ii) Secondly, we evaluate the quality of the graph we have selected by 
showing consistency and establishing a fast rate of convergence with respect to the operator and 
Frobenius norm for the estimated inverse covariance matrix; under sparsity constraints, the latter 
is of lower order than the corresponding results for the GLasso [Rothman et al., 2008] and for the 
SCAD-type estimator [Lam and Fan, 2009]. (iii) We show predictive risk consistency and provide 
a rate of convergence of the estimated covariance matrix, (iv) Lastly, we show general results for 
the MLE, where only approximate graph structures are given as input. Besides these theoretical 
advantages, we found empirically that our graph based method performs better in general, and 
sometimes substantially better than the GLasso, while we never found it clearly worse. Moreover, 
we compare it with an adaptation of the method Space [Peng et al., 2009]. Finally, our algorithm is 
simple and is comparable to the GLasso both in terms of computational time and implementation 
complexity. 

There are a few key motivations and consequences for proposing such an approach based on 
graphical modeling. We will theoretically show that there are cases where our graph based 
method can accurately estimate conditional independencies among variables, i.e. the zeroes of 
£ , in situations where GLasso fails. The fact that GLasso easily fails to estimate the zeroes 
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of £ has been recognized by Meinshausen [2008] and it has been discussed in more details 
in Ravikumar et al. [2008]. Closer relations to existing work are primarily regarding our first 
stage of estimating the structure of the graph. We follow the nodewise regression approach 
from Meinshausen and Biihlmann [2006] but we make use of recent results for variable selection 
in linear models assuming the much weaker restricted eigenvalue condition [Bickel et al., 2009; 
Zhou, 2010b] instead of the restrictive neighborhood stability condition [Meinshausen and Biihlmann, 
2006] or the equivalent irrepresentable condition [Zhao and Yu, 2006]. In some sense, the novelty 
of our theory extending beyond Zhou [2010b] is the analysis for covariance and inverse covariance 
estimation and for risk consistency based on an estimated sparse graph as we mentioned above. 
Our regression and thresholding results build upon analysis of the thresholded Lasso estimator 
as studied in Zhou [2010b]. Throughout our analysis, the sample complexity is one of the key 
focus point, which builds upon results in Rudelson and Zhou [2011]; Zhou [2010a]. Once the 
zeros are found, a constrained maximum likelihood estimator of the covariance can be computed, 
which was shown in Chaudhuri et al. [2007]; it was unclear what the properties of such a pro- 
cedure would be. Our theory answers such questions. As a two-stage method, our approach is 
also related to the adaptive Lasso [Zou, 2006] which has been analyzed for high-dimensional sce- 
narios in Huang et al. [2008]; van de Geer et al. [2010]; Zhou et al. [2009]. Another relation can 
be made to the method by Riitimann and Biihlmann [2009] for covariance and inverse covariance 
estimation based on a directed acyclic graph. This relation has only methodological character: the 
techniques and algorithms used in Riitimann and Biihlmann [2009] are very different and from a 
practical point of view, their approach has much higher degree of complexity in terms of compu- 
tation and implementation, since estimation of an equivalence class of directed acyclic graphs is 
difficult and cumbersome. There has also been work that focuses on estimation of sparse directed 
Gaussian graphical model. Verzelen [2010] proposes a multiple regularized regression procedure 
for estimating a precision matrix with sparse Cholesky factors, which correspond to a sparse di- 
rected graph. He also computes non-asymptotic Kullback Leibler risk bound of his procedure for 
a class of regularization functions. It is important to note that directed graph estimation requires a 
fixed good ordering of the variables a priori. 

Notation. We use the following notation. Given a graph G = (V,Eq), where V = {1, . . . ,p} is 
the set of vertices and Eq is the set of undirected edges, we use s* to denote the degree for node i, 
that is, the number of edges in Eq connecting to node i. For an edge set E, we let \E\ denote its 
size. We use ©o — 1 an< ^ to refer to the true precision and covariance matrices respectively 
from now on. We denote the number of non-zero elements of by supp(0). For any matrix 
W = (wij), let \W\ denote the determinant of W, tx(W) the trace of W. Let (p max (W) and 
tPmmiW) be the largest and smallest eigenvalues, respectively. We write diag(VF) for a diagonal 
matrix with the same diagonal as W and offd(VF) = W — diag(W). The matrix Frobenius norm is 
given by || W|| F = \jY^i Z)j w ij- The operator norm \\W\\l is given by ip max (WW T ). We write 
| • |i for the l\ norm of a matrix vectorized, i.e., for a matrix |W|i = HvecWl^ = Yli Ylj \ w ij\> 
and sometimes write ||W|| for the number of non-zero entries in the matrix. For an index set T 
and a matrix W = [w^], write Wt = £ T)), where /(•) is the indicator function. 
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2. The model and the method 

We assume a multivariate Gaussian model 

X = (X 1 ,...,X p )~M p (p,V ), where E ,« = 1. (1) 

The data is generated by X^\ . . . , X^ i.i.d. ~ A/" p (0, So). Requiring the mean vector and all 
variances being equal to zero and one respectively is not a real restriction and in practice, we can 
easily center and scale the data. We denote the concentration matrix by 0o = Sq 1 . 

Since we will use a nodewise regression procedure, as described below in Section 2.1, we consider 
a regression formulation of the model. Consider many regressions, where we regress one variable 
against all others: 

Xi = J2Pj X 3+Vi(i = l,...,p), where (2) 

Vi ~ W(0, o\ ) independent of {X,-; j ^ i} (i = 1, . . . , p) . (3) 

There are explicit relations between the regression coefficients, error variances and the concentra- 
tion matrix Go = (@o,ij) : 

% = -0 Q #/e 0iii , Var(^) := a\ = l/9 ,u = 1, . . . ,p). (4) 

Furthermore, it is well known that for Gaussian distributions, conditional independence is encoded 
in 0o, and due to (4), also in the regression coefficients: 

Xi is conditionally dependent of Xj given {X^; k € {1, . . . ,p} \ {i, j}} 

^ e ,ij ^ <=> Pjyt and P) ^ 0. (5) 

For the second equivalence, we assume that Var(Vj) = l/9o,u > and Var(Vj) = > 0. 

Conditional (in-)dependencies can be conveniently encoded by an undirected graph, the condi- 
tional independence graph which we denote by G = (V,Eq). The set of vertices is V = {1 , . . . , p} 
and the set of undirected edges Eq C V x V is defined as follows: 

there is an undirected edge between nodes i and j 

^ 9o,ij 7^0 ^>0and/3j^0. (6) 

Note that on the right hand side of the second equivalence, we could replace the word "and" by 
"or". For the second equivalence, we assume Var(Vi), Var(V}) > following the remark after (5). 

We now define the sparsity of the concentration matrix 0o or the conditional independence graph. 
The definition is different than simply counting the non-zero elements of 0o, for which we have 
supp (Go) = p + 2|£?o|- We consider instead the number of elements which are sufficiently large. 
For each i, define the number s n as the smallest integer such that the following holds: 

v 

£ min{^.,A 2 ^} < Son A 2 o ,u, where A = y / 2log(p)/n, (7) 
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where essential sparsity s l n at row i describes the number of "sufficiently large" non-diagonal 
elements Oq^ relative to a given (n,p) pair and 9o t a,i = 1, . . . ,p. The value So,™ hi (8) is 
summing essential sparsity across all rows of @ , 

v 

So,n ■= ^2 s 0,n- ( 8 ) 

i=l 

Due to the expression of A, the value of So,n depends on p and n. For example, if all non- 
zero non-diagonal elements #o,ij of the ith row are larger in absolute value than X^Oo.u, the 
value Sg n coincides with the node degree s\ However, if some (many) of the elements |#o,r/| 
are non-zero but small, s n is (much) smaller than its node degree s l ; As a consequence, if some 
(many) of |#o,y | ; Vi, j, i ^ j are non-zero but small, the value of So,n is a l so (much) smaller than 
2\Eq\, which is the "classical" sparsity for the matrix (6o — diag(Oo)). See Section A for more 
discussions. 



2.1 The estimation procedure 

The estimation of ©o and So = ©q 1 is pursued in two stages. We first estimate the undirected 
graph with edge set Eq as in (6) and we then use the maximum likelihood estimator based on 
the estimate E n , that is, the non-zero elements of n correspond to the estimated edges in E n . 
Inferring the edge set Eq can be based on the following approach as proposed and theoretically 
justified in Meinshausen and Biihlmann [2006]: perform p regressions using the Lasso to obtain 
p vectors of regression coefficients /3 1 , . . . , (3 P where for each i, p l = {/?*•; j G {1, . . . ,p} \ i}; 
Then estimate the edge set by the "OR" rule, 

estimate an edge between nodes i and j <J=^ /3* ^ or (3f ^ 0. (9) 

Nodewise regressions for inferring the graph. In the present work, we use the Lasso in 
combination with thresholding [Zhou, 2009, 2010b]. Consider the Lasso for each of the nodewise 
regressions 

n 

0U = argmhy ^(xf - £ f^xff + A n ^ l$l for i = 1, . . . ,p, (10) 

r=l j^i j^i 

where A n > is the same regularization parameter for all regressions. Since the Lasso typically 
estimates too many components with non-zero estimated regression coefficients, we use thresh- 
olding to get rid of variables with small regression coefficients from solutions of (10): 

^•(A n) r) = 4 init (A re )I(|/3j iMt (A n )| > r), (11) 

where r > is a thresholding parameter. We obtain the corresponding estimated edge set as 
defined by (9) using the estimator in (1 1) and we use the notation 

E n (X n ,r). (12) 
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We note that the estimator depends on two tuning parameters A n and r. 

The use of thresholding has clear benefits from a theoretical point of view: the number of false 
positive selections may be much larger without thresholding (when tuned for good prediction), 
and a similar statement would hold when comparing the adaptive Lasso with the standard Lasso. 
We refer the interested reader to Zhou [2009, 2010b] and van de Geer et al. [2010]. 

Maximum likelihood estimation based on graphs. Given a conditional independence graph 
with edge set E, we estimate the concentration matrix by maximum likelihood. Denote by S n = 
n~ l X]"=i X( r \X^) T the sample covariance matrix (using that the mean vector is zero) and by 

f n = diag(5 n )- 1 / 2 (5 n )diag(5 n )- 1 / 2 (13) 

the sample correlation matrix. The estimator for the concentration matrix in view of (1) is: 

@ n (E) = argmin eg _ Mp E (tr(®f n ) - log |9|) , where 

M p ,e = {6e K pxp ; 9^0 and % = for all E, where i ^ j} (14) 

defines the constrained set for positive definite 9. If n > q* where q* is the maximal clique 
size of a minimal chordal cover of the graph with edge set E, the MLE exists and is unique, 
see, for example Uhler [2011, Corollary 2.3]. We note that our theory guarantees that n > q* 
holds with high probability for G = (V,E), where E = E n {X n ,r)), under Assumption (Al) to 
be introduced in the next section. The definition in (14) is slightly different from the more usual 
estimator which uses the sample covariance S n rather than T n . Here, the sample correlation matrix 
reflects the fact that we typically work with standardized data where the variables have empirical 
variances equal to one. The estimator in (14) is constrained leading to zero-values corresponding 
to E c = {(i,j) = l,...,p,i^j, g E}. 

If the edge set E is sparse having relatively few edges only, the estimator in (14) is already suf- 
ficiently regularized by the constraints and hence, no additional penalization is used at this stage. 
Our final estimator for the concentration matrix is the combination of (12) and (14): 

9 n = 9 n ( J g n (A n ,r)). (15) 

Choosing the regularization parameters. We propose to select the parameter A„ via cross- 
validation to minimize the squared test set error among all p regressions: 

^ p 

A n = argmin A ^ (CV-score(A) of ith regression) , 

i=l 

where CV-score(A) of ith regression is with respect to the squared error prediction loss. Sequen- 
tially proceeding, we then select r by cross-validating the multivariate Gaussian log-likelihood, 
from (14). Regarding the type of cross-validation, we usually use the 10-fold scheme. Due to the 
sequential nature of choosing the regularization parameters, the number of candidate estimators is 
given by the number of candidate values for A plus the number of candidate value for r. In Sec- 
tion 4, we describe the grids of candidate values in more details. We note that for our theoretical 
results, we do not analyze the implications of our method using estimated A n and r. 
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3. Theoretical results 

In this section, we present in Theorem 1 convergence rates for estimating the precision and the co- 
variance matrices with respect to the Frobenius norm; in addition, we show a risk consistency re- 
sult for an oracle risk to be defined in (17). Moreover, in Proposition 4, we show that the model we 
select is sufficiently sparse while at the same time, the bias term we introduce via sparse approxi- 
mation is sufficiently bounded. These results illustrate the classical bias and variance tradeoff. Our 
analysis is non-asymptotic in nature; however, we first formulate our results from an asymptotic 
point of view for simplicity. To do so, we consider a triangular array of data generating random 
variables 

X^\...,X^ n) i.i.d. ~ A/" p (0,£ ), n = 1,2,... (16) 

where So = So, n and p = p n change with n. Let Go '■= Sq . We make the following assump- 
tions. 

(AO) The size of the neighborhood for each node i G V is upper bounded by an integer s < p 
and the sample size satisfies for some constant C 

n > Cs\og(p/s). 

(Al) The dimension and number of sufficiently strong non-zero edges So,n as i n (8) satisfy: di- 
mension p grows with n following p = o(e cn ) for some constant < c < 1 and 

So,n = o(n/ log max(n,p)) (n — > oo). 

(A2) The minimal and maximal eigenvalues of the true covariance matrix So are bounded: for 
some constants M upp > M\ ow > 0, we have 

(So) > M\ ov > and y? m ax(So) < M upp < oo. 

Moreover, throughout our analysis, we assume the following. There exists v 2 > such that 
for all i, and Vi as defined in (3): Var(Vi) = l/#o,ii > v 2 ■ 

Before we proceed, we need some definitions. Define for B >- 

i?(e)=tr(GS )-log|G|, (17) 

where minimizing (17) without constraints gives Go- Given (8), (7), and Go, define 

Quag := min{ max 6^, max (s l /S , n ) ■ ||diag(G )||^}. (18) 
i=i, ...p i=\,...,p 

We now state the main results of this paper. We defer the specification on various tuning parame- 
ters, namely, A n , r to Section 3.2, where we also provide an outline for Theorem 1. 
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Theorem 1 Consider data generating random variables as in (16) and assume that (AO), (Al), 
and (A2) hold. We assume Eo,m = ^f or a ^ Then, with probability at least 1 — d/p 2 ,for some 
small constant d > 2, we obtain under appropriately chosen \ n and t, an edge set E n as in (12), 
such that 

\E n \< 4S ,n, where \E n \E \< 2S , n ; (19) 
and for Q n and E n = (0 n ) _1 as defined in (15) the following holds, 



e n -e 

S n — Eq 



< ||e„-e ||F 

2 



Op ( yj S . n log max(n, p)/n\ , 



2 <||E n -S ||F = Op I yS , n log max(n,p)/n ) , 
R(@ n )-R(@o) = Op (5 , n log max(n,p)/n) 



where the constants hidden in the Op() notation depend on r, Mi ow , M upp , Cdiag a* *w (18), an J 
constants concerning sparse and restrictive eigenvalues ofT,Q (cf. Section 3.2 and B). 

We note that convergence rates for the estimated covariance matrix and for predictive risk depend 
on the rate in Frobenius norm of the estimated inverse covariance matrix. The predictive risk can 
be interpreted as follows. Let X ~ A/"(0, So) with fy denoting its density. Let /g be the density 
for jV(0, £ n ) and L>KL(So||£ n ) denotes the Kullback Leibler (KL) divergence from jV(0, So) to 
7V(0, E n ). Now, we have for E, E n >- 0, 

R(O n ) - R(@ ) := 2E [log/ Eo (X) - log/g n (X)] := 2D KL (Z \\Z n ) > 0. (20) 

Actual conditions and non-asymptotic results that are involved in the Gelato estimation appear in 
Sections B, C, and D respectively. 

Remark 2 Implicitly in (Al ), we have specified a lower bound on the sample size to be n = 
(<Sb,n logmax(n,p)). For the interesting case of p > n, a sample size of 

n = n (max(S , o i „logp, slog(p/s))) (21) 

is sufficient in order to achieve the rates in Theorem 1. As to be shown in our analysis, the lower 
bound on n is slightly different for each Frobenius norm bound to hold from a non- asymptotic 
point of view (cf. Theorem 19 and 20). 

Theorem 1 can be interpreted as follows. First, the cardinality of the estimated edge set exceeds 
So,n at m ost by a factor 4, where 5o, n as in (8) is the number of sufficiently strong edges in the 
model, while the number of false positives is bounded by 2So,n- Note that the factors 4 and 2 can 
be replaced by some other constants, while achieving the same bounds on the rates of convergence 
(cf. Section D.l). We emphasize that we achieve these two goals by sparse model selection, where 
only important edges are selected even though there are many more non-zero edges in Eq, under 
conditions that are much weaker than (A2). More precisely, (A2) can be replaced by conditions 
on sparse and restrictive eigenvalues (RE) of Eq. Moreover, the bounded neighborhood constraint 
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(AO) is required only for regression analysis (cf. Theorem 15) and for bounding the bias due to 
sparse approximation as in Proposition 4. This is shown in Sections B and C. Analysis follows 
from Zhou [2009, 2010b] with earlier references to Bickel et al. [2009]; Candes and Tao [2007]; 
Meinshausen and Yu [2009] for estimating sparse regression coefficients. 

We note that the conditions that we use are indeed similar to those in Rothman et al. [2008], with 
(Al) being much more relaxed when 5o,n "C \Eq\. The convergence rate with respect to the 
Frobenius norm should be compared to the rate Op(y/\Eo\ logmax(n,p)/n) in case diag(Eo) is 
known, which is the rate in Rothman et al. [2008] for the GLasso and for SCAD [Lam and Fan, 
2009]. In the scenario where |Eb| 3> 5b,n> i- e - there are many weak edges, the rate in Theorem 
1 is better than the one established for GLasso [Rothman et al., 2008] or for the SCAD-type esti- 
mator [Lam and Fan, 2009]; hence we require a smaller sample size in order to yield an accurate 
estimate of Bo- 

Remark 3 For the general case where £o,ii, * = 1; • • • > P are not assumed to be known, we could 
achieve essentially the same rate as stated in Theorem 1 for ||0 n — ©q || 2 an d ||5j n — So II 2 
under (Aq), {A{) and (A2) following analysis in the present work (cf. Theorem 6) and that 
in Rothman et al. [2008, Theorem 2 J. Presenting full details for such results are beyond the scope 
of the current paper. We do provide the key technical lemma which is essential for showing such 
bounds based on estimating the inverse of the correlation matrix in Theorem 6; see also Remark 7 
which immediately follows. 

In this case, for the Frobenius norm and the risk to converge to zero, a too large value of p is not 
allowed. Indeed, for the Frobenius norm and the risk to converge, (Al) is to be replaced by: 

(A3) p X n c for some constant < c < 1 and p + <So,n = °( n l 1°§ max(ra,p)) as n — > 00. 

In this case, we have 



©n-©o||F = Op [ J(p + So >n )logmax(n,p)/n 



||E n -E ||jr = Op I y(p + So, n ) log max(n,p)/n 

R(@ n )-R(& ) = Op ((p + S ,n) log max(n,p)/n). 

Moreover, in the refitting stage, we could achieve these rates with the maximum likelihood estima- 
tor based on the sample covariance matrix S n as defined in (22): 



® n (E) = argmin eeMpE (tr(6S„) - log |0|J , where 

M PtE = {8e M pxp ; >- and 0% = for all (i,j) E, where i ^ j} (22) 

A real high-dimensional scenario where p ^> n is excluded in order to achieve Frobenius norm 
consistency. This restriction comes from the nature of the Frobenius norm and when considering 
e.g. the operator norm, such restrictions can indeed be relaxed as stated above. 
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It is also of interest to understand the bias of the estimator caused by using the estimated edge set 
E n instead of the true edge set Eq. This is the content of Proposition 4. For a given E n , denote by 

9 = diag(9 ) + (®o)e„ = diag(9 ) + ©o^niV 

where the second equality holds since &o,E% = 0. Note that the quantity Bo is identical to Go on 
E n and on the diagonal, and it equals zero on E% = : i, j = 1, . . . ,p, i ^ j, E n }. 

Hence, the quantity @o,v '■= ©o — @o measures the bias caused by a potentially wrong edge set 
E n ; note that Go = 6o if E n = E . 

Proposition 4 Consider data generating random variables as in expression (16). Assume that 
(AO), (Al ), and (A2) hold. Then we have for choices on X n , r as in Theorem 1 and E n in (12), 

||0o,d|| f := ||@0 - ©oll-F = Op \J S , n log max(n, p)/n^j . 

We note that we achieve essentially the same rate for || (Go)" 1 — So I If; see Remark 27. We give 
an account on how results in Proposition 4 are obtained in Section 3.2, with its non-asymptotic 
statement appearing in Corollary 17. 



3.1 Discussions and connections to previous work 

It is worth mentioning that consistency in terms of operator and Frobenius norms does not depend 
too strongly on the property to recover the true underlying edge set Eq in the refitting stage. 
Regarding the latter, suppose we obtain with high probability the screening property 

E C E, (23) 

when assuming that all non-zero regression coefficients |/3j| are sufficiently large (E might be an 
estimate and hence random). Although we do not intend to make precise the exact conditions 
and choices of tuning parameters in regression and thresholding in order to achieve (23), we state 
Theorem 5, in case (23) holds with the following condition: the number of false positives is 
bounded as \E \ E Q \ = 0(5). 

Theorem 5 Consider data generating random variables as in expression (16) and assume that 
(Al) and (A2) hold, where we replace 5o, n with S := \Eq\ = Yli=i s *- We assume £o,n = l/ or 
all i. Suppose on some event £, such that P (£) > 1 — d/p 2 for a small constant d, we obtain 
an edge set E such that Eq C E and \E \ Eq\ = O(S). Let Q n (E) be the minimizer as defined 
in (14). Then, we have \\Q n (E) — &o\\f = Op (^\/ S log max(n, ~p)jn\ . 

It is clear that this bound corresponds to exactly that of Rothman et al. [2008] for the GLasso 
estimation under appropriate choice of the penalty parameter for a general S >- with Ejj = 1 
for all i (cf. Remark 3). We omit the proof as it is more or less a modified version of Theorem 19, 
which proves the stronger bounds as stated in Theorem 1 . We note that the maximum node-degree 
bound in (AO) is not needed for Theorem 5. 
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We now make some connections to previous work. First, we note that to obtain with high prob- 
ability the exact edge recovery, E = Eq, we need again sufficiently large non-zero edge weights 
and some restricted eigenvalue (RE) conditions on the covariance matrix as defined in Section A 
even for the multi-stage procedure. An earlier example is shown in Zhou et al. [2009], where 
the second stage estimator (3 corresponding to (11) is obtained with nodewise regressions using 
adaptive Lasso [Zou, 2006] rather than thresholding as in the present work in order to recover 
the edge set Eq with high probability under an assumption which is stronger than (AO). Clearly, 
given an accurate E n , under (Al) and (A2) one can then apply Theorem 5 to accurately estimate 
G n . On the other hand, it is known that GLasso necessarily needs more restrictive conditions on 
So than the nodewise regression approach with the Lasso, as discussed in Meinshausen [2008] 
and Ravikumar et al. [2008] in order to achieve exact edge recovery. 

Furthermore, we believe it is straightforward to show that Gelato works under the RE conditions 
on S and with a smaller sample size than the analogue without the thresholding operation in 
order to achieve nearly exact recovery of the support in the sense that Eq E n and max^ \ E n ^ \ 
Eq : {\ is small, that is, the number of extra estimated edges at each node i is bounded by a small 
constant. This is shown essentially in Zhou [2009, Theoreml.l] for a single regression. Given such 
properties of E n , we can again apply Theorem 5 to obtain n under (Al) and (A2). Therefore, 
Gelato requires relatively weak assumptions on So in order to achieve the best sparsity and bias 
tradeoff as illustrated in Theorem 1 and Proposition 4 when many signals are weak, and Theorem 5 
when all signals in Eq are strong. 

3.2 An outline for Theorem 1 

Let so = maxj = i t p Sq n . We note that although sparse eigenvalues /3max(s)> Pmax (3so) and 
restricted eigenvalue for So (cf. Section A) are parameters that are unknown, we only need them 
to appear in the lower bounds for do, D4, and hence also that for A n and tQ that appear below. 
We simplify our notation in this section to keep it consistent with our theoretical non-asymptotic 
analysis to appeal - toward the end of this paper. 

Regression. We choose for some cq > 4\/2, < 9 < 1, and A = y / log(p)/n, 

X n = d X, where d > c (l + 6) 2 y/p max {s)p max (3s ). 

Let /3? nit , i = 1, ... ,p be the optimal solutions to (10) with A n as chosen above. We first prove an 
oracle result on nodewise regressions in Theorem 15. 

Thresholding. We choose for some constants D\, D4 to be defined in Theorem 15, 

*o — /o A := D^cIqX where D4 > D\ 

where D\ depends on restrictive eigenvalue of So; Apply (1 1) with r = tQ and /3? nit , i = 1, . . . ,p 
for thresholding our initial regression coefficients. Let 

!./ : 3 / < t = /oA}, 
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where bounds on i = 1, . . . ,p are given in Lemma 16. In view of (9), we let 

V = {(i,j) : eV'nVi}. (24) 

Selecting edge set E. Recall for a pair we take the OR rule as in (9) to decide if it is to be 
included in the edge set E: for V as defined in (24), define 

E :={(%, j) :i,j = l,...,p,i^j,(i,j) $V}. (25) 

to be the subset of pairs of non-identical vertices of G which do not appear in V; Let 

O = diag(Go) + Oo,e oe (26) 

for E as in (25), which is identical to ©o on all diagonal entries and entries indexed by E$ n E, 
with the rest being set to zero. As shown in the proof of Corollary 17, by thresholding, we have 
identified a sparse subset of edges E of size at most 45o, n , such that the corresponding bias 
1 1 ©0,2? 1 1 jr := ll®o — ©o||f is relatively small, i.e., as bounded in Proposition 4. 

Refitting. In view of Proposition 4, we aim to recover Qq given a sparse subset E; toward this 
goal, we use (14) to obtain the final estimator n and S n = (0 n ) -1 . We give a more detailed 
account of this procedure in Section D, with a focus on elaborating the bias and variance tradeoff. 
We show the rate of convergence in Frobenius norm for the estimated n and E n in Theorem 6, 
19 and 20, and the bound for Kullback Leibler divergence in Theorem 21 respectively. 



3.3 Discussion on covariance estimation based on maximum likelihood 

The maximum likelihood estimate minimizes over all 0^0, 

R n (G) =tr(05„)-log|0| (27) 

where S n is the sample covariance matrix. Minimizing R n (Q) without constraints gives S n = S n . 
We now would like to minimize (27) under the constraints that some pre-defined subset V of edges 
are set to zero. Then the follow relationships hold regarding @ n (E) defined in (22) and its inverse 
S n , and S n : for E as defined in (25), 

@n,ij = 0, V(*,j) € V and 

%n,ij = Sn,ij, V(z,j) G EU{(i,i),i = l,...,p}. 

Hence the entries in the covariance matrix S n for the chosen set of edges in E and the diagonal 
entries are set to their corresponding values in S n . Indeed, we can derive the above relationships 
via the Lagrange form, where we add Lagrange constants 7^ for edges in V, 

£ c (&) =log|0| -tr(5 n 0)- Tjfcfyfc- ( 2§ ) 

(j,k)ev 
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Now the gradient equation of (28) is: 

e- 1 - s n - r = o, 

where F is a matrix of Lagrange parameters such that 7^ / for all (j, k) G V and 7^ = 
otherwise. 

Similarly, the follow relationships hold regarding Q n (E) defined in (14) in case S a = 1 for all 
i, where S n is replaced with T n , and its inverse £„, and T n : for E as defined in (25), 

&n,ij = 0, V(i,j) G P and 

£n,i? = r nj jj = S n ,ij/aidj, V(i,j) G £ and 

^n,ii = 1, Vt = 1, . . . ,p. 

Finally, we state Theorem 6, which yields a general bound on estimating the inverse of the corre- 
lation matrix, when £0,11, • • • > ^>o,pp take arbitrary unknown values in R + = (0, 00). The corre- 
sponding estimator is based on estimating the inverse of the correlation matrix, which we denote 
by S7o- We use the following notations. Let = (po,ij) be the true correlation matrix and let 

i7o = ^0 1 - L et W = diag(Eo) 1 / 2 . Let us denote the diagonal entries of W with a\, . . . ,a p 

1/2 

where Oi := S u for all i. Then the following holds: 

S = WV W and6 = W _1 JW _1 

Given sample covariance matrix S n , we construct sample correlation matrix T n as follows. Let 

W = diag(5 n ) 1 / 2 and 

f n = W~HS n )W-\ where = = ^ (29) 

(JiUj ||Aj|| 2 ||Aj|| 2 

where af := S n> u. Thus T n is a matrix with diagonal entries being all Is and non-diagonal entries 
being the sample correlation coefficients, which we denote by pij. 

The maximum likelihood estimate for f2 = ^0 minimizes over all ft y 0, 

fl B (J2)=tr(nf n )-log|n| (30) 

To facilitate technical discussions, we need to introduce some more notation. Let S+ + denote the 
set of p x p symmetric positive definite matrices: 

Sl + = {9 G R pxp \Q y 0}. 

Let us define a subspace S P E corresponding to an edge set E C {(i, j) : i, j = 1, . . . ,p, i / j}: 

S P E := {9 G K pxp I % = V i ^ j s.t. (i, j) G* and denote 5 n = Sl + n 5|. (31) 

Minimizing R n (Q) without constraints gives ^ n = T n . Subject to the constraints that Q G S n as 
defined in (31), we write the maximum likelihood estimate for Qo : 

Q n (E) := arg min R n (fl) = arg min {tr(Or n ) — log } (32) 
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which yields the following relationships regarding £l n (E), its inverse ^ n = (Cl n (E)) 1 , and T n . 
For E as defined in (25), 

fi n ,ij = 0, V(i, j) e V 

and = 1 Vi = l, 

Given Q n (E) and its inverse \l/ n = (fi n (.E)) _1 , we obtain 

S n = |?$ n |? and 6 n = W^fW -1 
and therefore the following holds: for E as defined in (25), 

®n,ij = 0, V(i,j) £P 

and $ nji i = df = S n ,u Vi = l,...,p. 
The proof of Theorem 6 appears in Section E. 

Theorem 6 Consider data generating random variables as in expression (16) and assume that 
(Al) and (A2) hold. Let := maxj £o,ii < oo arcc? o"^ in := min^ £o,jj > 0. Lef £ be some 
event such that P (£) > 1 — d/p 2 for a small constant d. Let Sq !U be as defined in (8). Suppose on 
event £: 

1. We obtain an edge set E such that its size \E\ = lin (So,n) 

is a linear function in Sq n . 

2. And for Qq as in (26) and for some constant Ct,i as to be specified in (7 1 ), we have 



\®o,v\\ F 



G -e 



< C biasA /25o, n log(p)/n. (33) 



Let Ct n (E) be as defined in (32) Suppose the sample size satisfies for C3 > 4y 5/3, 

n > 144< ax / + ^MnpA max { 2 |£|logmax(n,p), C b 2 ias 25 ,„ logp} . (34) 

iW low V 11(r mm J 

Then with probability > l-(d+l)/p 2 , we have for M = (9a^ ax /(2fc 2 ))- (4C 3 + 13M U pp/(12(T^ in )) 
h n (E) -n F <(M + l)max(y/2\E\ log max(n, p)/n, C b ias^/2S* ,nlog(p)/n| . (35) 

Remark 7 We note that the constants in Theorem 6 are by no means the best possible. From (35), 
we can derive bounds on \\Q n (E) — ©o [I2 an d \\Yi n (E) — So||2 to be in the same order as in (35) 
following the analysis in Rothman et al. [2008, Theorem 2 ]. The corresponding bounds on the 

Frobenius norms on covariance estimation would be in the order of Op ^^/ P ^f° "^ as stated in 

Remark 3. 
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4. Numerical results 

We consider the empirical performance for simulated and real data. We compare our estimation 
method with the GLasso, the Space method and a simplified Gelato estimator without thresholding 
for inferring the conditional independence graph. The comparison with the latter should yield 
some evidence about the role of thresholding in Gelato. The GLasso is defined as: 



where T n is the empirical correlation matrix and the minimization is over positive definite ma- 
trices. Sparse partial correlation estimation (Space) is an approach for selecting non-zero partial 
correlations in the high-dimensional framework. The method assumes an overall sparsity of the 
partial correlation matrix and employs sparse regression techniques for model fitting. For details 
see Peng et al. [2009]. We use Space with weights all equal to one, which refers to the method type 
space in Peng et al. [2009]. For the Space method, estimation of ©o is done via maximum likeli- 
hood as in (14) based on the edge set En Pa f rom the estimated sparse partial correlation matrix. 
For computation of the three different methods, we used the R-packages glmnet [Friedman et al., 
2010], glasso [Friedman et al., 2007] and space [Peng et al., 2009]. 

4.1 Simulation study 

In our simulation study, we look at three different models. 

• An AR(l)-Block model. In this model the covariance matrix is block-diagonal with equal- 
sized AR(l)-blocks of the form S B/ocfc = {0.9^}^. 

• The random concentration matrix model considered in Rothman et al. [2008]. In this model, 
the concentration matrix is = B + 51 where each off-diagonal entry in B is generated 
independently and equal to or 0.5 with probability 1 — tt or n, respectively. All diagonal 
entries of B are zero, and 5 is chosen such that the condition number of is p. 

• The exponential decay model considered in Fan et al. [2009]. In this model we consider a 
case where no element of the concentration matrix is exactly zero. The elements of ©o are 
given by Oo^j = exp(— 2\i — j\) equals essentially zero when the difference \i — j\ is large. 

We compare the three estimators for each model with p = 300 and n = 40, 80, 320. For each 
model we sample data . . . i.i.d. ~ A/"(0, So). We use two different performance 

measures. The Frobenius norm of the estimation error ||E n — So||f and ||0 n — @o||f> and the 
Kullback Leibler divergence between A/"(0, Eo) and AA(0, S n ) as defined in (20). 

For the three estimation methods we have various tuning parameters, namely A, r (for Gelato), p 
(for GLasso) and r\ (for Space). We denote the regularization parameter of the Space technique 
by r/ in contrary to Peng et al. [2009], in order to distinguish it from the other parameters. Due to 
the computational complexity we specify the two parameters of our Gelato method sequentially. 
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That is, we derive the optimal value of the penalty parameter A by 10-fold cross-validation with 
respect to the test set squared error for all the nodewise regressions. After fixing A = Xqv we 
obtain the optimal threshold r again by 10-fold cross-validation but with respect to the negative 
Gaussian log-likelihood (tr(QS out ) — log |0|, where 5°"* is the empirical covariance of the hold- 
out data). We could use individual tuning parameters for each of the regressions. However, this 
turned out to be sub-optimal in some simulation scenarios (and never really better than using a 
single tuning parameter A, at least in the scenarios we considered). For the penalty parameter p 
of the GLasso estimator and the parameter rj of the Space method we also use a 10-fold cross- 
validation with respect to the negative Gaussian log-likelihood. The grids of candidate values are 
given as follows: 



fogp k = 11Q wkh Tk = Q75 . B 
n V n 



Pk = C k \l^- fc = l, ...,10 
V n 

ri r = 1.56Vn$ -1 ( 1- ft 



2p 

where A k , B k , C k G {0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 4, 8, 16} and D r G {0.01, 0.05, 0.075, 0.1, 0.2, 
0.5, 1}. The two different performance measures are evaluated for the estimators based on the 
sample . . . , X^> with optimal CV-estimated tuning parameters A, r, p and r/ for each model 
from above. All results are based on 50 independent simulation runs. 



4.1.1 The AR(l)-BLOCK MODEL 

We consider two different covariance matrices. The first one is a simple auto-regressive process 
of order one with trivial block size equal to p = 300, denoted by Sq 1 ^ This is also known as a 
Toeplitz matrix. That is, we have Sq 1 ]^ = .91* I V i, j G {1, ...,£>}■ The second matrix Sq 2 ^ 
is a block-diagonal matrix with AR(1) blocks of equal block size 30 x 30, and hence the block- 
diagonal of £q equals ^Biock;i,j = 0.9'* - - 7 ', i, j G {1, . . . ,30}. The simulation results for the 
AR(l)-block models are shown in Figure 1 and 2. 

The figures show a substantial performance gain of our method compared to the GLasso in both 
considered covariance models. This result speaks for our method, especially because AR(1)- 
block models are very simple. The Space method performs about as well as Gelato, except for 
the Frobenius norm of S n — So- There we see an performance advantage of the Space method 
compared to Gelato. We also exploit the clear advantage of thresholding in Gelato for a small 
sample size. 

4. 1.2 The random precision matrix model 

For this model we also consider two different matrices, which differ in sparsity. For the sparser 
matrix Oq 3 ^ we set the probability it to 0.1. That is, we have an off diagonal entry in 0( 3 ) of 0.5 



16 



High-dimensional Covariance Estimation 




0.2 0.4 0.6 0.8 1.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.05 0.10 0.15 0.20 0.25 0.30 0.35 

Lambda/Rho Lambda/Rho Lambda/Rho 



(g) n = 40 (h) n = 80 (i) n = 320 

Figure 1 : Plots for model Eg 1 ' 1 . The triangles (green) stand for the GLasso and the circles (red) for 
our Gelato method with a reasonable value of r. The horizontal lines show the perfor- 
mances of the three techniques for cross-validated tuning parameters A, r, p and rj. The 
dashed line stands for our Gelato method, the dotted one for the GLasso and the dash- 
dotted line for the Space technique. The additional dashed line with the longer dashes 
stands for the Gelato without thresholding. Lambda/Rho stands for A or p, respectively. 
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Figure 2: Plots for model Sq . The triangles (green) stand for the GLasso and the circles (red) for 
our Gelato method with a reasonable value of r. The horizontal lines show the perfor- 
mances of the three techniques for cross-validated tuning parameters A, r, p and 77. The 
dashed line stands for our Gelato method, the dotted one for the GLasso and the dash- 
dotted line for the Space technique. The additional dashed line with the longer dashes 
stands for the Gelato without thresholding. Lambda/Rho stands for A or p, respectively. 
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1 2 3 4 5 1 2 3 0.0 0.5 1.0 1.5 

Lambda/Rho Lambda/Rho Lambda/Rho 



(g) n = 40 (h) n = 80 (i) n = 320 

(3) 

Figure 3: Plots for model 0q . The triangles (green) stand for the GLasso and the circles (red) for 
our Gelato method with a reasonable value of r. The horizontal lines show the perfor- 
mances of the three techniques for cross-validated tuning parameters A, r, p and r\. The 
dashed line stands for our Gelato method, the dotted one for the GLasso and the dash- 
dotted line for the Space technique. The additional dashed line with the longer dashes 
stands for the Gelato without thresholding. Lambda/Rho stands for A or p, respectively. 
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(a) n = 40 
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(d) n = 40 
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(g) n = 40 



E 
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(b) n = 80 



(e) n = 80 



Lambda/Rho 

(h) n = 80 



\/ 



0.5 1.0 

Lambda/Rho 

(c) n = 320 



0.5 1.0 

Lambda/Rho 

(f) n = 320 



1.0 1.5 

Lambda/Rho 



(i) ra = 320 



Figure 4: Plots for model 0q . The triangles (green) stand for the GLasso and the circles (red) for 
our Gelato method with a reasonable value of r. The horizontal lines show the perfor- 
mances of the three techniques for cross-validated tuning parameters A, r, p and rj. The 
dashed line stands for our Gelato method, the dotted one for the GLasso and the dash- 
dotted line for the Space technique. The additional dashed line with the longer dashes 
stands for the Gelato without thresholding. Lambda/Rho stands for A or p, respectively. 
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with probability ir = 0.1 and an entry of with probability 0.9. In the case of the second matrix 
0q 4 ^ we set 7r to 0.5 which provides us with a denser concentration matrix. The simulation results 
for the two performance measures are given in Figure 3 and 4. 

From Figures 3 and 4 we see that GLasso performs better than Gelato with respect to ||0 n — O \\f 
and the Kullback Leibler divergence in both the sparse and the dense simulation setting. If we 
consider ||S n — So I If, Gelato seems to keep up with GLasso to some degree. For the Space 
method we have a similar situation to the one with GLasso. The Space method outperforms Gelato 
for ||0 n — ©o||f and Z)KL(So||S n ) but for ||S n — So||f> Gelato somewhat keeps up with Space. 

4.1.3 THE EXPONENTIAL DECAY MODEL 

(5) 

In this simulation setting we only have one version of the concentration matrix O . The entries 

(5*) (5) 

of 0q are generated by 6q [■ = exp(— 2 1 z — Thus, So is a banded and sparse matrix. 

Figure 5 shows the results of the simulation. We find that all three methods show equal per- 
formances in both the Frobenius norm and the Kullback Leibler divergence. This is interesting 
because even with a sparse approximation of ©o (with GLasso or Gelato), we obtain competitive 
performance for (inverse) covariance estimation. 

4.1.4 Summary 

Overall we can say that the performance of the methods depend on the model. For the models 
Sq 1 ^ and Sq 2 ^ the Gelato method performs best. In case of the models and , Gelato gets 

(5) 

outperformed by GLasso and the Space method and for the model O none of the three methods 
has a clear advantage. In Figures 1 to 4, we see the advantage of Gelato with thresholding over 
the one without thresholding, in particular, for the simulation settings Sq 1 ^, Sq 2 ^ and 0q 3 ^- Thus 
thresholding is a useful feature of Gelato. 

4.2 Application to real data 

4.2.1 ISOPRENOID GENE PATHWAY IN ARABIDOBSIS THALIANA 

In this example we compare the two estimators on the isoprenoid biosynthesis pathway data given 
in Wille et al. [2004]. Isoprenoids play various roles in plant and animal physiological processes 
and as intermediates in the biological synthesis of other important molecules. In plants they serve 
numerous biochemical functions in processes such as photosynthesis, regulation of growth and 
development. 

The data set consists of p = 39 isoprenoid genes for which we have n = 118 gene expression pat- 
terns under various experimental conditions. In order to compare the two techniques we compute 
the negative log-likelihood via 10-fold cross-validation for different values of A, r and 
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(c) n = 320 
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(e) n = 80 
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(i) n = 320 



Figure 5: Plots for model 0q ' . The triangles (green) stand for the GLasso and the circles (red) for 
our Gelato method with a reasonable value of r. The horizontal lines show the perfor- 
mances of the three techniques for cross-validated tuning parameters A, r, p and 77. The 
dashed line stands for our Gelato method, the dotted one for the GLasso and the dash- 
dotted line for the Space technique. The additional dashed line with the longer dashes 
stands for the Gelato without thresholding. Lambda/Rho stands for A or p, respectively. 
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4567 5678 
log(LO-norm) log(LO-norm) 

(a) isoprenoid data (b) breast cancer data 



Figure 6: Plots for the isoprenoid data from arabidopsis thaliana (a) and the human breast cancer 
data (b). 10-fold cross-validation of negative log-likelihood against the logarithm of 
the average number of non-zero entries of the estimated concentration matrix 6 n . The 
circles stand for the GLasso and the Gelato is displayed for various values of r. 

p. In Figure 6 we plot the cross-validated negative log-likelihood against the logarithm of the 
average number of non-zero entries (logarithm of the ^o- n o rm ) of the estimated concentration 
matrix 0„. The logarithm of the £rr norm reflects the sparsity of the matrix 0„ and therefore the 
figures show the performance of the estimators for different levels of sparsity. The plots do not 
allow for a clear conclusion. The GLasso performs slightly better when allowing for a rather dense 
fit. On the other hand, when requiring a sparse fit, the Gelato performs better. 

4.2.2 Clinical status of human breast cancer 

As a second example, we compare the two methods on the breast cancer dataset from West et al. 
[2001]. The tumor samples were selected from the Duke Breast Cancer SPORE tissue bank. The 
data consists of p = 7129 genes with n = 49 breast tumor samples. For the analysis we use the 
100 variables with the largest sample variance. As before, we compute the negative log-likelihood 
via 10-fold cross-validation. Figure 6 shows the results. In this real data example the interpretation 
of the plots is similar as for the arabidopsis dataset. For dense fits, GLasso is better while Gelato 
has an advantage when requiring a sparse fit. 

5. Conclusions 

We propose and analyze the Gelato estimator. Its advantage is that it automatically yields a positive 
definite covariance matrix £„, it enjoys fast convergence rate with respect to the operator and 
Frobenius norm of E n — Sq and Q n — Qq. For estimation of Qq, Gelato has in some settings a 
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better rate of convergence than the GLasso or SCAD type estimators. From a theoretical point of 
view, our method is clearly aimed for bounding the operator and Frobenius norm of the inverse 
covariance matrix. We also derive bounds on the convergence rate for the estimated covariance 
matrix and on the Kullback Leibler divergence. From a non-asymptotic point of view, our method 
has a clear advantage when the sample size is small relative to the sparsity S = \Eq\: for a given 
sample size n, we bound the variance in our re-estimation stage by excluding edges of Eq with 
small weights from the selected edge set E n while ensuring that we do not introduce too much 
bias. Our Gelato method also addresses the bias problem inherent in the GLasso estimator since 
we no longer shrink the entries in the covariance matrix corresponding to the selected edge set E n 
in the maximum likelihood estimate, as shown in Section 3.3. 

Our experimental results show that Gelato performs better than GLasso or the Space method for 
AR-models while the situation is reversed for some random precision matrix models; in case of an 
exponential decay model for the precision matrix, all methods exhibit the same performance. For 
Gelato, we demonstrate that thresholding is a valuable feature. We also show experimentally how 
one can use cross-validation for choosing the tuning parameters in regression and thresholding. 
Deriving theoretical results on cross-validation is not within the scope of this paper. 
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Appendix A. Theoretical analysis and proofs 

In this section, we specify some preliminary definitions. First, note that when we discuss estimat- 
ing the parameters So and Qq = Sq , we always assume that 

Vmax(So) := l/Vmin(©o) < 1/c < OO and l/Vmax(@o) = <Anin(£o) > k > 0, (36) 

where we assume k, c < 1 so that c < 1 < 1/k. (37) 
It is clear that these conditions are exactly that of (A2) in Section 3 with 

M upp := 1/c and M iow := k, 

where it is clear that for £o,u = 1, i = L • • • ,P, we have the sum of p eigenvalues of So, 
Yli=i Vt(So) = tr(So) = P- Hence it will make sense to assume that (37) holds, since other- 
wise, (36) implies that ip m - m (T,o) = ¥? max (£o) = 1 which is unnecessarily restrictive. 

We now define parameters relating to the key notion of essential sparsity sq as explored in Candes and Tao 
[2007]; Zhou [2009, 2010b] for regression. Denote the number of non-zero non-diagonal entries 
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in each row of Go by s l . Let s = max != i r „ :P s l denote the highest node degree in G = (V, Eq). 
Consider nodewise regressions as in (2), where we are given vectors of parameters j = 
1,. . . ,p,j 7^ i} for i = l,...,p. With respect to the degree of node i for each i, we define 
s}j < s l < s as the smallest integer such that 



min((/3j) 2 , A 2 Var(^)) < s^Var^), where A = y / 2\ogp/ 
where s denotes s n as defined in (7). 



ii, 



(38) 



Definition 8 (Bounded degree parameters.) The size of the node degree s l for each node i is 
upper bounded by an integer s < p. For s' l as in (38), define 



sq := max s < sand Sq u := > s 
i=l,...,p . ' 

i=l,..., p 



(39) 



where Sq^ h is exactly the same as in (8), although we now drop subscript nfrom s l Q in order to 
simplify our notation. 

We now define the following parameters related to So- For an integer m < p, we define the 
smallest and largest m-sparse eigenvalues of Sq as follows: 



V Pmm(m) 



mm 

t^0;m— sparse 



Eft 



V V>max(«l) 



max 

ty^0;m— sparse 



Eft 



Definition 9 (Restricted eigenvalue condition RE(so, ko, So)). For some integer I < so < p 
and a positive number k , the following condition holds for all 



mm mm 



1 > 0, (40) 



K(s , &o, £ ) ->c{i,...,p}, llvjcl^^fcollvjlli ||^j||2 

I J|<S0 

where vj represents the subvector ofv£M p confined to a subset Jof{l,... ,p}. 

When so an d ko become smaller, this condition is easier to satisfy. When we only aim to estimate 
the graphical structure Eq itself, the global conditions (36) need not hold in general. Hence up till 
Section D, we only need to assume that Eo satisfies (40) for sq as in (38), and the sparse eigenvalue 
Pmin(s) > 0. In order of estimate the covariance matrix Eo, we do assume that (36) holds, which 
guarantees that the RE condition always holds on So, and p maJ c(m), Pmi n (m) are upper and lower 
bounded by some constants for all m < p. We continue to adopt parameters such as K, p ma , x (s), 
and Pmax(3so) for the purpose of defining constants that are reasonable tight under condition (36). 
In general, one can think of 

Pmax(max(3so, s)) <C 1/c < oo and K 2 (sq, ko, So) <C 1/k < oo, 

for c, k as in (36) when sq is small. 
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Roughly speaking, for two variables Xi,Xj as in (1) such that their corresponding entry in ©o = 
(#o,ij) satisfies: #o,ij < ^\/9q.u, where A = y2 log(p)/n, we can not guarantee that (i, j) G E n 
when we aim to keep x s edges for node i, i = 1, . . . ,p. For a given @ , as the sample size n 
increases, we are able to select edges with smaller coefficient #o,ij - In fact it holds that 

\&o,ij\ < A which is equivalent to < Xay v for all j > s + 1 + I i<s ; +1 , (41) 

where Lr.} is the indicator function, if we order the regression coefficients as follows: 

> \&\... > > > \/3;\, 

in view of (2), which is the same as if we order for row i of ©o, 

|#0,il| > I ^0,£,2 1 — > > |^0,i,i+l | — > l#0,i,p|- (42) 

This has been show in [Candes and Tao, 2007]; See also Zhou [2010b]. 



A.l Concentration bounds for the random design 

For the random design X generated by (16), let £o,u = 1 for all i. In preparation for showing the 
oracle results of Lasso in Theorem 33, we first state some concentration bounds on X. Define for 
some < < 1 

JF{&) :={X:Vj = l,...,p,l-9< < 1 + 6} , (43) 

where X±,. . . , X p are the column vectors of the n x p design matrix X. When all columns of X 
have an Euclidean norm close to y/n as in (43) , it makes sense to discuss the RE condition in the 
form of (44) as formulated in [Bickel et al., 2009]. For the integer 1 < so < P as defined in (38) 
and a positive number ko, RE(so, ko,X) requires that the following holds for all u^0, 

1 A ll^^ll? 

— = mm min — — > 0. (44) 

K(so,k ,X) Jc{i,...,p},\\v J c\\ 1 <k \\vj\\ 1 y/n\\vj\\ 2 

\J\<S 

The parameter ko > is understood to be the same quantity throughout our discussion. The 
following event 1Z provides an upper bound on K{so, ko, X) for a given ko > when So satisfies 
RE(so, ko, So) condition: 

K(0) := jx : RE(s ,k ,X) holds with < K(s ,k ,X) < g ( s O'^ S o) | (45) 

For some integer m < p, we define the smallest and largest m-sparse eigenvalues of X to be 

A m in("i) := min ll-^II^A 7 ' 1 IMI2) anc ^ (46) 

v^0;m— sparse 

A m axM := , max ll^lll)' ( 47 ) 

v^0;m— sparse 
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upon which we define the following event: 

M(9) := {X : (49) holds Vm < max(s, (k + l)s )} , for which (48) 
< (1 - 9) v / p min {m) < y/A min (m) < V / A max (m) < (1 + 9) v / p max (m). (49) 

Formally, we consider the set of random designs that satisfy all events as defined, for some < 
9 < 1. Theorem 10 shows concentration results that we need for the present work, which follows 
from Theorem 1.6 in Zhou [2010a] and Theorem 3.2 in Rudelson and Zhou [201 1], 

Theorem 10 Let < 9 < 1. Let p m i n (s) > 0, where s < p is the maximum node-degree 
in G. Suppose RE(sq, 4, So) holds for sq as in (39), where £o,ii = 1 fo r i — 1, • • • ,p. Let 
/(so) = min (4soPmax('So) log(5ep/so) 5 so logp). Letc,a,c' > be some absolute constants. 
Then, for a random design X as generated by (16), we have 

P (X) := P (K(9) n F(9) n M{8)) > 1 - 3 exp(-c0 2 n/a 4 ) (50) 

as long as the sample size satisfies 

n > max ^ max (36if (s , 4, £ )/(s ), logp) , g2 logf — )L (51) 

Remark 11 We note that the constraint s < p/2, which has appeared in Zhou [2010a, Theorem 
1.6] is unnecessary. Under a stronger RE condition on T,q, a tighter bound on the sample size 
n, which is independent of p max (so), is given in Rudelson and Zhou [2011] in order to guarantee 
1Z(8). We do not pursue this optimization here as we assume that p max (so) is a bounded constant 
throughout this paper. We emphasize that we only need the first term in (51) in order to obtain 
J- (9) and TZ(8); The second term is used to bound sparse eigenvalues of order s. 

A.2 Definitions of other various events 

Under (Al) as in Section 3, excluding event X c as bounded in Theorem 10 and events C a , Xq to 
be defined in this subsection, we can then proceed to treat X G X n C a as a deterministic design 
in regression and thresholding, for which TZ(8) D M{9) H J 7 (9) holds with C a , We then make use 
of event Xq in the MLE refitting stage for bounding the Frobenius norm. We now define two types 
of correlations events C a and Xq. 

Correlation bounds on Xj and V%. In this section, we first bound the maximum correlation 
between pairs of random vectors (Vi,Xj), for all i,j where i ^ j, each of which corresponds to a 
pair of variables (Vi,Xj) as defined in (2) and (3). Here we use Xj and Vi, for all i,j, to denote 
both random vectors and their corresponding variables. 

Let us define ay. := yJ\ai(Vj) > v > as a shorthand. Let Vj := Vj/ay^j = 1, ... ,p be a 
standard normal random variable. Let us now define for all j, k 7^ j, 

1 1 n 

Zjk = -{ vj, x k } = - y2 v'jAx k i, 
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where for alii = 1, . . . , n v'j i5 x^ )i , Vj, j are independent standard normal random variables. 
For some a > 6, let event 

C n ■= s max |Zjjfc| < y/l + a^/{2\ogp)/n where a>6 . (52) 
I i.fc J 

Bounds on pairwise correlations in columns of X. Let So := (cro,ij), where we denote do m ; — 
of. Denote by A = X T X/n — Sq. Consider for some constant C3 > 4-^/5/3, 



Xq := < max|A,-fc| < C^aiaj ylog max{p, < 1/2 > . (53) 
I i.fc J 

We first state Lemma 12, which is used for bounding a type of correlation events across all regres- 
sions; see proof of Theorem 15. It is also clear that event C a is equivalent to the event to be defined 
in (54). Lemma 12 also justifies the choice of A n in nodewise regressions (cf. Theorem 15). We 
then bound event Xq in Lemma 13. Both proofs appear in Section A.3. 

Lemma 12 Suppose that p < e n / 4C 2. Then with probability at least 1 — 1/p 2 , we have 
Vj ^ k, 



-{Vj,X k [ 
n 



< a Vj Vl + a^(2logp)/n (54) 



where ay 3 = Y / Var(V ? -) and a > 6. Hence P (C a ) > 1 — 1/p 2 . 

Lemma 13 For a random desi 
where C3 > 4-^/5/3, we have 



Lemma 13 For a random design X as in (l)with £ojj = 1, Vj G {1, . . . ,p}, and for p < e n l AC l, 



{Xq) > 1 - l/max{n,p} 2 



We note that the upper bounds on p in Lemma 12 and 13 clearly hold given (Al). For the rest 
of the paper, we prove Theorem 15 in Section B for nodewise regressions. We proceed to derive 
bounds on selecting an edge set E in Section C. We then derive various bounds on the maximum 
likelihood estimator given E in Theorem 19- 21 in Section D, where we also prove Theorem 1. 
Next, we prove Lemma 12 and 13 in Section A.3. 



A.3 Proof of Lemma 12 and 13 

We first state the following large inequality bound on products of correlated normal random vari- 
ables. 

Lemma 14 Zhou et al. [2008, Lemma 38] Given a set of identical independent random variables 
Yi, . . . , Y n ~ Y, where Y = x\x 2 , with xi,x 2 ~ ^(0, 1) and o\ 2 = p\ 2 with p\ 2 < 1 being their 
correlation coefficient. Let us now define Q = - X^iLi =: ~(-^i>^2 ) = ^ SILi x i.i x 2, «■ Let 
*12 = (1 + d 2 2 )/2. For < r < ^ 12 , 
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Proof of Lemma 12. It is clear that event (54) is the same as event C a . Clearly we have at 
most p(p — 1) unique entries Zjk, Vj ^ k. By the union bound and by taking r = C21 /^S£ 
in (55) with cr jfc = 0, Vj, fc, where ^2(1 + a) > C 2 > 2^10/3 for a > 6. 



1 - P (C a ) = P I max |Z ifc | > ^2(1 + 0)^^ 



< P (max|Z jt | > C 2 ^) < - P )e X p (-"a*") 



10 y ' ' p 2 

where we apply Lemma 14 with = 0, Vj, A; = 1, . . . ,p, j ^ k and use the fact that EZjk = 0. 

™/4C| miarantpps that r7„, /HF 



Note that p < e n '^ guarantees that C 2 yf ^ < 1/2. ■ 

In order to bound the probability of event Xq, we first state the following large inequality bound 
for the non-diagonal entries of S , which follows immediately from Lemma 14 by plugging in 

°f = a o,u = 1) Vi = 1, . . . ,p and using the fact that |cro,?fc| = \Pjk&jO~k\ < 1 , Vj ^ A;, where pjk 
is the correlation coefficient between variables Xj and X^. Let ^ jk = (1 + a 2 . - k ) /2. Then 

{3nr 2 f 3nr 2 1 

~ 10(1 + a 2 , k) j ^ ex P {""20") for ° ~ T ~ *J fc " (56) 

We now also state a large deviation bound for the x„ distribution [Johnstone, 2001]: 

^-l>rj < exp^-),for0<T<-. (57) 

Lemma 13 follows from (56) and (57) immediately. 

Proof of Lemma 13. Now it is clear that we have p(p — l)/2 unique non-diagonal entries 
0"OjfcjVj 7^ /c and p diagonal entries. By the union bound and by taking r = C3 y/lgg g^fa 7 ^ 
in (57) and (56) with crojfc < L we have 

'log max{p, n} 



'((A- ) c ) = P max|A ifc | >C7 ; 



jk V n 

/ 3C| log max{p, re} \ p 2 — p ( 3Cf log max{p, re} 
< p exp — H — exp ' 



16 J 2 r V 20 

3Cf logmaxfare} ^ i\-^f-+2 ^ 1 

< p exp — = (max{p,n}) 20 < 



20 J ' " (max{p, n}) 2 

for C3 > 4y5/3, where for the diagonal entries we use (57), and for the non-diagonal entries, we 
use (56). Finally, p < e n / 4C $ guarantees that c 3 J losma * {p & < 1/2. ■ 
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Appendix B. Bounds for nodewise regressions 

In Theorem 15 and Lemma 16 we let s be as in (38) and Tq denote locations of the s largest 
coefficients of ff in absolute values. For the vector h % to be denned in Theorem 15, we let T\ 
denote the s largest positions of h l in absolute values outside of Tq, Let := Tq U I\\ We 
suppress the superscript in Tq, T{, and T^ throughout this section for clarity. 

Theorem 15 (Oracle inequalities of the nodewise regressions) Let < < 1. Let p m - m (s) > 0, 
where s < p is the maximum node-degree in G. Suppose RE(so,4,Y i q) holds for sq < s as 
in (39), where ^o,a = 1 forall i. Suppose /9max (max(s, 3so)) < oo. The data is generated by 
X^ 1 ' , . . . , X( n ) i.i.d. ~ Ap(0, So), where the sample size n satisfies (51). 

Consider the nodewise regressions in (10), where for each i, we regress X onto the other variables 
{X; k ^ i} following (2), where Vi ~ X(0, Var(Vj)) is independent of Xj,\/j ^ i as in (3). 

Let /5? nit be an optimal solution to (10) for each i. Let X n = d$\ = a^Acr^ where do is chosen 
such that do > 2(1 + 9)\/l + a holds for some a > 6. Let h l = /3? nit — /3^ Q . Then simultaneously 
for all i, on C a n X, where X := 11(0) n JF(0) n M (9), we have 



/3init-/3'|| 2 < \Js d J2D 2 + 2D 2 l +2,where 



1 



1 1 ^Tbi 1 1 2 — DodoXy Sq and 
where Dq, D\ are defined in (109) and (110) respectively. 
Suppose we choose for some constant cq > 4\/2 and ao = 7, 



An: 



^ < Did Q \s l Q (58) 



^0 = Co(l + 6') 2 V / /Omax(s)Pmax(3so), 

where we assume that p max (max(s, 3so)) < oo is reasonably bounded. Then 

5if 2 (so,4,E ) 49K 2 ( So ,4,Sq) 

-Co < ; ^ and D\ < ; — n — . 

u - (1 - 6) 2 16(1 - 0) 2 

The choice of do will be justified in Section F, where we also the upper bound on Do, D\ as above. 

Proof Consider each regression function in (10) with X.w being the design matrix and Xj the 
response vector, where X\j denotes columns of X excluding X- It is clear that for X n = doX, we 
have for i = 1, . . . , p and a > 6, 

A n = (do/av.)a Vi X := d a Vi X > d Xa Vi > 2(1 + 6)\y/l + aa Vi = 2(1 + 0)A CT)O)P 

such that (108) holds given that ay i < 1, Vi, where it is understood that cr := ery r 

It is also clear that on C a n Af, event T a (~^X holds for each regression when we invoke Theorem 33, 
with Y := X and X := Xu, for i = 1, . . . ,p. By definition c^cry = do. We can then invoke 
bounds for each individual regression as in Theorem 33 to conclude. ■ 
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Appendix C. Bounds on thresholding 

In this section, we first show Lemma 16, following conditions in Theorem 15. 

Lemma 16 Suppose RE(sq, 4, So) holds for so be as in (39) and p m \ n (s) > 0, where s < pis the 
maximum node-degree in G. Suppose p max (max(s, 3so)) < oo. Let S l = {j : j ^ i, /3j / 0}. 
Let Co > 4\/2 Z?e some absolute constant. Suppose n satisfies (51). Let /3? nit fte an optimal solution 
to (10) with 

\ n = d X where d = c (l + 9) 2 \/Pmax(s)Pmax(3so); 
Suppose for each regression, we apply the same thresholding rule to obtain a subset P as follows, 

I' !./ : J ■■ '•• I^Mtl >io = /oA}, anrf D* := {1, . . . , i — 1, i + 1, . . . ,p} \ P 

where fo := D^d^for some constant D 4 to be specified. Then we have on event C a n X, 

\P\ < + Dx/D^and \P VJ S l \ < s l + (D l / D^s^ and (59) 



\\Ph\\ 2 < d Q \^ S yi + {D Q + D A f (60) 

where T> is understood to be T> 1 and Dq,D\ are the same constants as in Theorem 15. 

We now show Corollary 17, which proves Proposition 4 and the first statement of Theorem 1. 
Recall ©o = • Let 0o,x> denote the submatrix of @o indexed by V as in (24) with all other 
positions set to be 0. Let Eq be the true edge set. 

Corollary 17 Suppose all conditions in Lemma 16 hold. Then on event C a n X,for ©o as in (26) 
and E as in (25), we have for 5o, n as in (39) and ©o = (#o,y) 

\E\ < 2(1 + Di/D A )S , n where \E\E \<2D 1 /D 4 So tn (61) 



1 00,2? Hi? : = 



©o-0o 



< /min{5o,„( max 0g ft ),s o ||diag(0 o )||^} ^(1 + (A) + D^)d X (62) 

i=l,...p ' 



S , n (1 + (.Do + D 4 ) 2 )C diag d X 

where Cj iag := min{maxj = i ) ... p Oq ^, (so/<Sb,n) ||diag(@o)||^}, and Dq, D\ are understood to be 
the same constants as in Theorem 15. Clearly, for D 4 > D±, we have (19). 

Proof By the OR rule in (9), we could select at most 2|/j| edges. We have by (59) 

\E\ < 2(l + £> 1 /D 4 )4 = 2(l + r> 1 /D 4 )5o, n , 

i=l,...p 

where (2D 1 /D i )S , n is an upper bound on \ E \ Eq\ by (63). Thus 

i=l i=l 

< min{S ,„(max 6 2 u ), s \\d[ a g(Q )\\ 2 F }(l + (D + D 4 ) 2 )d 2 \ 2 
i=i, ...p 1 
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Remark 18 Note that if sq is small, then the second term in Cdi ag will provide a tighter bound. 



Proof of Lemma 16. Let To := T ' denote the s largest coefficients of fS 1 in absolute values. 
We have 

1 



\r n r c | < 



init.T; 



i/oA 



< Didos l /(D 4 do) < D lS l /D A 



(63) 



by (58), where D\ is understood to be the same constant that appears in (58). Thus we have 

\f\ = |rnr c | + |/ i nr | < si{i + D x /D±). 

Now the second inequality in (59) clearly holds given (63) and the following: 

\f u 5*| < \S l \ + \l* n (5 i ) c | < s i + \f n (Tq) c |. 

II ' 1 1 2 

We now bound ||/3^,|| 2 following essentially the arguments as in Zhou [2009]. We have 



1^ 



v\\ 2 



where for the second term, we have 



^T nv\\ 2 + 

2 



< 



Ti 



2 
2 



< SqA ay. by definition of s l as in (38) 



and (41); For the first term, we have by the triangle inequality and (58), 

||/3T n©|| 2 - IK/ 3 * ~ Anit)T n©|| 2 + ||(Anit)Tonc|| 2 

< ||(/3 l -/3Lt)T || 2 + W|T nP| < \\h To \\ 2 + t J4 



< D d Q \Js l + D 4 d \Js l Q < (D + D 4 )d \Js l . 



Appendix D. Bounds on MLE refitting 



Recall the maximum likelihood estimate G n minimizes over all G S S n the empirical risk: 
@ n (E) = arg min i? n (©) := arg min |tr(0r n ) — log | @ | } 



0G-S 



(64) 



which gives the "best" refitted sparse estimator given a sparse subset of edges E that we obtain 
from the nodewise regressions and thresholding. We note that the estimator (64) remains to be 
a convex optimization problem, as the constraint set is the intersection the positive definite cone 
5++ and the linear subspace S P E . Implicitly, by using T n rather than S n in (64), we force the 
diagonal entries in (0 n (E'))~ 1 to be identically 1. It is not hard to see that the estimator (64) is 
equivalent to (14), after we replace S n with r ?t . 
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Theorem 19 Consider data generating random variables as in expression (16) and assume that 
(Al), (36), and (37) hold. Suppose £o,ii = lfor all i. Let £ he some event such that P (£) > 
1 — d/p 2 for a small constant d. Let Sq u be as defined in (39); Suppose on event £: 

1. We obtain an edge set E such that its size \E\ = lin(Sv^ n ) is a linear function in Sa n . 

2. And for Go as in (26) and for some constant Cbi as to be specified, we have 

1 1 ©0,25 1 1 p := ©0 — ©0 



< C biasA /25 ,„log(p)/n < c/32. 



(65) 



Let Q n (E) be as defined in (64). Suppose the sample size satisfies for C3 > 4y5/3, 

2 



106 
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n > [ 4C3 + ^2 J max { 2 \ E \ logmax(n,p), C£ ias 2S 0>n logp} . 
Then on event £ n Xq, we have for M = (9/{2k 2 )) ■ (4C7 3 + 32/(31c 2 )) 

@ n (E)-e \\ F < (M + 1) max j y/2\E\ log max(n,p)/n, C7 bias ^25 ,„ log(p)/n| 



(66) 



• (67) 



We note that although Theorem 19 is meant for proving Theorem 1, we state it as an independent 
result; For example, one can indeed take E from Corollary 17, where we have \E\ < cSo^ n for 
some constant c for D4 x flj, In view of (62), we aim to recover 60 by <d n (E) as defined in (64). 
In Section D.2, we will focus in Theorem 19 on bounding for W suitably chosen, 



e n (E) -G p = Op ( ^ x /5 ,„logmax(n,p)/n ) . 



By the triangle inequality, we conclude that 



< 



@n(E) - G 



+ 



©o-©o 



Op[WJS ,nlog(n)/n 



We now state bounds for the convergence rate on Frobenius norm of the covariance matrix and for 
KL divergence. We note that constants have not been optimized. Proofs of Theorem 20 and 2 1 
appear in Section D.3 and D.4 respectively. 

Theorem 20 Suppose all conditions, events, and bounds on \E\ and ||Go/d||j? in Theorem 19 
hold. Let Q n (E) be as defined in (64). Suppose the sample size satisfies for C3 > 4-^/5/3 and 
Cbias; M as defined in Theorem 19 



106 f 32 \ 

n > ~^TjJ ( ACi + 31^2 ) maX { 2 \ E \ l0g max (^' n )' C bhs 2S 0,n logp} • 

Then on event £ n Af , we /zave ^ m ; n (G n (£')) > c/2 > and for T< n (E) = (@ n (E))~ 1 , 



(68) 



S„(£?) 



< 



2(M + 1) 



max 



2\E\ logmax(n,p) 



1? 



Cbi: 



25 , n log(p) 



(69) 
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Theorem 21 Suppose all conditions, events, and bounds on \E\ and ||Oo,x>||^ := ©o ~~ @c 

in Theorem 19 hold. Let Q n (E) be as defined in (64). Suppose the sample size satisfies (66) 
for C3 > 4-^/5/3 and Cbi as , M as defined in Theorem 19. Then on event £ n X$, we have for 
R{@ n (E)) - R(Q ) > 0, 



R(@ n (E))-R{Q ) < M(C 3 + 1/8) max {2\E\ log max(n,p)/n, C b 2 ias 2S ,n log(p)/n} . (70) 



D.l Proof of Theorem 1 



Clearly the sample requirement as in (51) is satisfied for some > that is appropriately chosen, 
given (66). In view of Corollary 17, we have on £ := X n C a : for Cdi ag as in (18) 

Di 



\E\ < 2(1 + - i )5 ,n < 45 , n for D 4 > D x and 
JJ4 



e 



0,T>\\ F 



•-"bias 



e -e 



i=l,.. .p "'**' So.n 



min< max u Qii , 
=i,...p 

C 2 dg(l + (A, + Z> 4 ) 2 ) 



< Cbias\/25o, n log(p)/n < c/32 where 



diag(e )||!K(i + (A) + £>4) 



(71) 



Clearly the last inequality in (65) hold so long as n > 32 2 (7 bias 2So,n log(p)/c 2 , which holds 
given (66). Plugging in \E\ in (67), we have on £ n Xq, 



,4(1 + J Di/L» 4 )5o,n)logmax(n,p) 2S , n \ogp 
Q n (E)-Q ^<(M + l)max<Y ^ , Cbias y — 



Now if we take D 4 > D\, then we have (19) on event £; and moreover on £ D Xq, 



Q n (E)-Q < (M + l)max<J ^/85 ,„logmax(n,p)/n, C hias \/2S 0yn log(p)/n 



< WJ 5 , , n logmax(n,p)/n 

where W < \pl{M + 1) max{C d i a g<io \A + (A) + A) 2 , 2}. Similarly, we get the bound on 
E n — So with Theorem 20, and the bound on risk following Theorem 21. Thus all statements 

F 

in Theorem 1 hold. ■ 

Remark 22 Suppose event £ n Xq holds. Now suppose that we take D4 = 1, that is, if we take 
the threshold to be exactly the penalty parameter \ n : 

to = d(,X '■= A n . 

Then we have on event £ by (61) \E\ < 2(1 + Di)So >n and \E\Eq\ < 2D\So )n and on event on 
£ n X ,for C bias := Cdiagdov 71 + Wo + I) 2 

&n(E) - e 



4(1 + Z?i)S , o, ri logmax(n,p) . /25 ,„logp 
P < M max <| y , C bias y -— _ 
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It is not hard to see that we achieve essential the same rate as stated in Theorem 1, with perhaps 
slightly more edges included in E. 

D.2 Proof of Theorem 19 

Suppose event £ holds throughout this proof. We first obtain the bound on spectrum of ©o: It is 
clear that by (36) and (65), we have on £ , 

¥>min(@o) > V?min(©o) - ©0 ~ ©0 > ¥>min(©o) - || ©0,© || F > 31c/32, (72) 

~ ~ c 1 

<Anax(@o) < <Anax(©o) + ©0 ~ ©0 ^ < ¥>max(©o) + || @0,Z> \\ F < ^ + T • (73) 

Throughout this proof, we let So = (o~o,ij) '■= @o 1 - I n y i ew °f (72), define S := (@o) _1 - We 
use Q n := @ n (E) as a shorthand. 

Given Go G 5+ + n S P E as guaranteed in (72), let us define a new convex set: 

U n (@ ) := (S p ++ n S|) - 9 = {B - O \B G <S^ + n <S£} C <S£ 

which is a translation of the original convex set S+ + n <Sj^. Let be a matrix with all entries 
being zero. Thus it is clear that U n (@o) B given that G G S+ + n <S|,. Define for _R n as m 
expression (64) 

Q(e) := ^ n (e)-^(e )=tr(er n )-iog|e|-tr(e f n )+iog|eo| 

= tr ((6 - 9o)(f n - So)) - (log |6| - log |6 |) + tr ((© - O )S O ) . 

For an appropriately chosen r„ and a large enough M > 0, let 

T n = {AeU n (@ ),\\A\\ F = Mr n }, and (74) 
n n = {A G U n (@o), \\A\\ F < Mr n }. (75) 

It is clear that both IT n and T n U H n are convex. It is also clear that G II n . Throughout this 
section, we let 

J /2|£| logmax(n,p) /25 , ,„ log pi 
r n = max<y , GbiasW >. (76) 

Define for A G U n (Q ), 

G(A) := Q(6 + A) = tr(A(f n - £„)) - (log |© + A| - log |0 O |) + tr(AS ) (77) 

It is clear that G(A) is a convex function on U n (@o) and G(0) = Q(@o) = 0. 

Now, n minimizes Q(0), or equivalently A = n — ©o minimizes G(A). Hence by definition, 

G(A) < G(0) = 
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Note that T n is non-empty, while clearly G Ii n . Indeed, consider B e := (1 + e)6o, where 



e > 0; it is clear that B e - O G 5? , n 5?, and 



6r 



Mr n for lei 



Mr n / 



Go 



Note also if A G T n , then Ay = QV(i,j : i / j) £ E; Thus we have A G «S| and 



|A| 



|diag(A)|| + ||offd(A)|| <p + 2|£| where |£7| =Un(5 ,„). 



(78) 



We now show the following two propositions. Proposition 23 follows from standard results. 



Proposition 23 Let B be a p x p matrix. If B y a«<i S + D y 0, B + dZ) >- for all 

v G [0, 1]. 

Proposition 24 Under (36), we have for all A G T n swc/z f/iaf ||A||^ = Mr n for r n as in (76), 
©0 + vA y 0, \/v G a« opera interval I D [0, 1] o?i even? £. 

Proof In view of Proposition 23, it is sufficient to show that ©o + (1 + e)A, ©o — eA y for 
some e > 0. Indeed, by definition of A G T n , we have ip mm (&o + A) y on event £; thus 

Vmin 

(0 o + (l + e)A) > 

Vmin (9 + A)-e||A|| 2 > 
and ^ min (0 o - eA) > ^ min (0 o ) - e || A|| 2 > 31c/32 - e || A|| 2 > 

for e > that is sufficiently small. ■ 

Thus we have that log |©o + vA\ is infinitely differentiable on the open interval I D [0, 1] of v. 
This allows us to use the Taylor's formula with integral remainder to obtain the following: 

Lemma 25 On event £ n X , G(A) > Ofor all A G T n . 

Proof Let us use A as a shorthand for 



'1 - v)(Q + vA)' 1 ® (@o + vA^dv ) vecA 



vecA J 



where ® is the Kronecker product (if W = (t%) mX n, P = (bki)pxq, then W®P = (wijP) mpxnq ), 

2 

and vecA G MP is A pxp vectorized. Now, the Taylor expansion gives for all A G T n , 



log|6 + A| - log |O | 



d 



d 1 



— \og\®Q + vA\\ v=Q A+ / (1 -v)-^\og\Q + vA\dv 



dv 

= tr(S A) - A. 

Hence for all A G T n , 

G(A) = 2 + tr (A(P n - S )) = A + tr ( A(f „ - S )) - tr (a(S - E ) 

Sq)) as follows: by (65) and (72), we have on event £ 



(79) 



where we first bound tr(A(So 
tr(A(S - S )) 



(A, (S -S ) 



< IIAI 



S r 



< 



< 



|A| 



|A| 



le 



0,T>\\ F 



Vmin (©o) (e ) 



32C b ias y/2S 0i n logp/ 



< IIA| 



32r n 



31c 



2 ' 



(80) 
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Now, conditioned on event Xq, by (89) and (66) 



max|f n j fc - a jk\ < 4C 3 ^log max(n, p)/n =: 5 n 



and thus on event £ n Xq, we have 

VI|offd(A)|| ||offd(A)|| F < ^2\E\ 



tr(A(P n - E )) < <^n|ofFd(A)| 1 , where |offd(A)| 1 < 
A|| F , and 



tr(A(f n -E )) > -4C 3 Vlogmax(n,p)/nv/2il| ||A|| F > -4C 3 r„ || A|| F . (81) 

Finally, we bound A. First we note that for A G T n , we have on event £, 

7 
16k' 



A|| 2 < ||A|| F = Mr n < — , (82) 



given (66): n > (f ■ |) 2 (4C7 3 + max {(2\E\) log(n), C7 b 2 ias 25 ,„, logp}. Now we have 

by (73) and (37) following Rothman et al. [2008] (see Page 502, proof of Theorem 1 therein): on 
event £ , 

,2 < < 



A > \\A\\ F / 2 (e ) + ||A|| 2 

^ »^/| 2 (i + I + li ,2 ) >l|A|1 ^ <83) 

Now on event £ n <*b, for all A G T n , we have by (79),(83), (81), and (80), 

G(A) > ||A|| F ^-4C 3 r n ||A|| F -||A|| F |^ 

I, A „2 f2k 2 1 / 32r n 

= A p —-7. 4C 3 r n H o 

11 " F V 9 ||A|| F V 31c 2 



Alii ( ^ - — ( 4C7 3 + 



2k 2 1 /„ 32 



F V 9 M V 31c 

hence we have G(A) > for M large enough, in particular M = {9/(2k 2 )) (4C 3 + 32/(31c 2 )) 
suffices. ■ 

We next state Proposition 26, which follows exactly that of Claim 12 of Zhou et al. [2008]. 
Proposition 26 Suppose event £ holds. IfG(A) > 0, VA G T n , then G(A) > Ofor all A in 

W n = {A : A G 17„(0 O ), || A|| F > Mr n } 

/or r n as in (76); tfercce ifG(A) > Ofor all A G T n , then G(A) > Ofor all A G T n U W n . 

Note that for n G S+ + n <S F , we have A = G n — Go G U n (@o). By Proposition 26 and the fact 
that G(A) < C7(0) = on event £, we have the following: on event £, if G(A) > 0, VA G T n 
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then || A|| F < Mr n , given that A G U n (Q ) \ (T n U W re ). Therefore 

P (j|A|| F > Mr n ) < P(£ c ) +¥(£) -P (\\A\\ F > Mr n \£^j 

= P {£ c ) + P (£) ■ (1 - P (j|A|| F < Mr n |£) ) 

< P(£ C ) + P(£)-(1-P(G(A) >0,VAeT n |f)) 

< ¥(£ c ) +¥(£) ■ (1 -F(X \£)) 

= ¥ (£ c ) + P (X£ n £) < P (£ c ) + P (Af c ) 

< C | 1 < C+ 1 
~~ p 2 max(n,p) 2 ~~ p 2 

We thus establish that the theorem holds. ■ 



D.3 Frobenius norm for the covariance matrix 



We use the bound on 



@n(E) - G c 



as developed in Theorem 19; in addition, we strengthen 



the bound on Mr n in (82) in (85). Before we proceed, we note the following bound on bias of 

(Oor 1 . 



Remark 27 Clearly we have on event £, by (80) 



(eo)- 1 



< 



e 



q,t>\\f 



Vmin (©o)</>min(©o) 



< 



32Cbi as ^2S ,n logp/ 



31c 2 



(84) 



Proof of Theorem 20. Suppose event £ fl Xq holds. Now suppose 
16 9 f 32 \ 

U>{y Tc 2k?^ \ 3 + 31C 2 j maX { 2 l^l lo S maX ( n 'P)' Cbias^O.nlogp} 

which clearly holds given (68). Then in addition to the bound in (82), on event £ n Xq, we have 

Mr n < 7c/16, (85) 
for r n as in (76). Then, by Theorem 19, for the same M as therein, on event £ n Ao, we have 

Q n {E) - @o p < (M + 1) max ly/2\E\ log max(n,p)/?i, C b ias^/2S ,„log(p)/n j 

given that sample bound in (66) is clearly satisfied. We now proceed to bound S n — So 
given (67). First note that by (85), we have on event £ n Xq for M > 7 



<Pmin{®n(E)) > </? m in(©o) ~ @n ~ ©0 

> c — (M + l)r n > c/2. 
Now clearly on event £ n Afo, (69) holds by (67) and 

e n (E) - G 



> ¥>min(G ) 



G.-Go 



Zn(E) 



< 



(©n(£)) Vrnin (Go) 



< 



G n (£) - Q c 
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D.4 Risk consistency 

We now derive the bound on risk consistency. Before proving Theorem 21, we first state two 
lemmas given the following decomposition of our loss in terms of the risk as defined in (17): 

< R(G n (E)) - R(e ) = (R(e n (E)) - R(e )) + (i?(G ) - R(G )) (86) 

where clearly R(Q n (E)) > R(Qq) by definition. It is clear that Go £ S n for S n as defined in (31), 
and thus R n (Q ) > R n (Q n (E)) by definition of @ n (E) = argmm 0e ,s n R n {®)- 

We now bound the two terms on the RHS of (86), where clearly R(@o) > R(Qq). 

Lemma 28 On event £, we have for Ct,i as , Oo> @o as in Theorem 19, 

< R(@ Q ) - fl(8 ) < (32/(31c)) 2 C7 b 2 ias 2S °': l0gP < (32/(31c)) 2 • r 2 /2 < Mr 2 /8 

In 

for r n as in (76), where the last inequality holds given that M > 9/2(46*3 + 32/(31c 2 )). 
Lemma 29 Under £ n Xq, we have for r n as in (76) and M, C3 as in Theorem 19 

R(& n (E)) - R(@ ) < MC 3 r 2 n . 

Proof of Theorem 21. We have on £ n Xq, for r n is as in (76) 
R(G n (E)) - R(O ) = (R(e n (E)) - R(Q )) + (R(O ) - R(@ )) < Mr 2 (C 3 + 1/8) 
as desired, using Lemma 28 and 29. ■ 

Proof of Lemma 28. For simplicity, we use Ao as a shorthand for the rest of our proof: 

a := Q ,v = e - e . 

We use B as a shorthand for 

vecA T (J^ (1 - v)(@ + v Ao)" 1 (6 + v Ao)" 1 ^ vecA , 

where ® is the Kronecker product. First, we have for 0o, @o >- 

R(@ )-R(Q ) = tr(eoSo)-log|Go|-tr(e So) + log|eo| 

= tr((6 - e )S ) - (log |§o| - log |0o|) ■= B > 

where B = holds when || Aq\\ f = 0, and in the last equation, we bound the difference between 
two log I • I terms using the Taylor's formula with integral remainder following that in proof of 
Theorem 19; Indeed, it is clear that on £, we have 

G + vA y for v £ (—1,2) D [0, 1] 
39 



Zhou, Rutimann, Xu, and Buhlmann 



given that ip mm (@o) > cand ||Ao|| 2 < ||Ao||^ < c/32 by (65). Thus log |©o + v Aq\ is infinitely 
differentiable on the open interval / D [0, 1] of v. Now, the Taylor expansion gives 



log|6 + A | — log I O 1 = — log|e + wA ||,;=oAo + 

av 



f 1 d 2 

J ( 1 - V )^2 1 °S\ e O+ vA o\ dv 



tr(S A ) 



B, 



We now obtain an upper bound on B > 0. Clearly, we have on event £ , Lemma 28 holds given 
that 



B < \\A \\f ■ ¥W (J o (1 - v)(@ + vAo)' 1 <8 (0 O + vAo^'dv 
where \\A \\ 2 F < C^ ias 25o,n \og(p)/n and 



< 



J\l - v)(@ + vAo)" 1 ® (0o + ^Aq)" 1 ^ 

/ (1 - v)ip 2 max (Q + vAo^dv < sup ifi^Qo + vAo)- 1 [ (1 - v)dv 
Jo ve\o,i] Jo 



1 



1 



sup 



< 



2 t-efoTi] Vmin( o + ^A ) 2inf ve[ o,i] ^ in (0 O + vA ) 

1 < 1 

2(^min(e )- ||A || 2 ) 2 " 2 (31c/32) 2 



lAr 



where clearly for all u G [0, 1], wehave ^ in (0 o + «A ) > (v? m in(©o) _ 11^0112 
given ^ min (6o) > cand ||A || 2 < ||0o,:d|| f < c/32 by (65). ■ 



) 2 > (31c/32) 2 , 



Proof of Lemma 29. Suppose R(Q n (E)) — R(Qq) < 0, then we are done. 

Otherwise, assume R(Q n (E)) — R(@o) > throughout the rest of the proof. Define 

A := Q n (E) - O , 
which by Theorem 19, we have on event £ n Xq, and for M as defined therein, 



A 



< Mr n . 



@n(E) - ©o 

We have by definition R n (@ n (E)) < i? n (@o), and hence 

o < R(e n {E)) - R(e ) = R(e n (E)) - R n (e n (E)) + R n (e n (E)) - R(e ) 

< R(@ n (E)) - Rn(@ n (E)) + i? n (0 o ) - R(O ) 
= tr(0„(E)(Eo-f n ))-tr(0 o (So-f n )) 
= tr((0 n (£) - ©o)(S - f n )) = tr(A(S - f n )) 
Now, conditioned on event £ n Xq, following the same arguments around (81), we have 

So) 



tr (A(S, 



< S„ 



where 



offd(A) 



offd(A) i < 5 ny /2\E\ offd(A) 
< Mr n C 3 y/2\E\ logmax(n,p)/n < MC 3 r 2 n 
< 2\E\ by definition, and r n is as defined in (76). ■ 



40 



High-dimensional Covariance Estimation 



Appendix E. Proof of Theorem 6 

We first bound F (X ) in Lemma 30, which follows exactly that of Lemma 13 as the covariance 
matrix f° r variables X\/o~\, . . . , X p /a p satisfy the condition that ^o : u = 1, Vi £ {1, . . . ,p}. 

Lemma 30 Forp < e n l AC l, where C3 > 4-^/5/3, we have for Xq as defined in (53) 



'(<*b) > 1 - l/max{n,p} 2 



On event Xq, the following holds for r = C 3 



log max{p,n} 



< 1/2, where we assume p < e n / 4C 3, 



i\\2 



cr 2 n 



V« / 3, 



-(Xi/(Ti,Xj/(Tj ) ~ po 



n 



< T 



< T. 



(87) 
(88) 



Let us first derive the large deviation bound for 

\/l — t < ||Xj|| 2 / (a iy /n) < a/1 + r and for all i 7^ j 



First note that on event Xt 







n,ij P0,ij 



Sua 



P0,ij 



\Pij — P0,ij\ 



±{Xi/(Ti,Xj/(Tj ) ~ Ihy.j 



+ 



P0,ij 



XihttviVnti-iWXjh/faVn)) (ll^ll 2 /(^))-(ll^-|| 2 /( CTj ^)) 



< 



< 



-(Xi/(Ti,Xj/<Tj ) - pQ4j 



+ 



POM 



P0,ij 



P0,ij 



1-T 



+ \P0,; 



1 - r 



2r 

< < At. 

1 — T 



(89) 



Proof of Theorem 6. For ©o as in (26), we define 

o = we w = w(di& g (e ))w + we ,E nEW 

= di ag (We W) + We ,£ ni?^ = diag(tto) + n 0tEo nE 
where W = diag(Ho) 1 ^ 2 - Then clearly O £ as ©o £ >-V We first bound ||©o,x)|| F as follows. 

13 



\®o,v\\ F < C bisis y2S 0in log(p)/n < 
kr 2 rr 2 

< "min 

" (48c 2 < in C7 3 + 13)< 



< min 



m 1 n 

2 



CCr min I <- - 
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Suppose event £ holds throughout this proof. We first obtain the bound on spectrum of Bo: It is 
clear that by (36) and (33), we have on 8, 



</>min(©o) > ^min(©o) ~ 
(G ) < ^max(So) + 



e -e 
e -e 



12c 

2 >^min(©o)-|l©0,©|| F >^J, (90) 

< ^max(e ) + \\@0,v\\f < io~2 + U' (91) 
2 J^'JO'max 



Throughout this proof, we let Sq = (o"o,ij ) := B • ^ n v * ew °^ (90), define S := (Bo) Then 







-i 



W^Bo) -1 ^ 1 = W^EoW -1 := ^ c 



(92) 



We use O n := ti n (E) as a shorthand. Thus we have for Qq = WQqW, 



<Anax(^o) < Vmax(W / )^ max (Bo)^max(W) < + g 

_ 1 i 

¥?min(^o) 



^max(^o) ^max(W- 1 S W- 1 ) ^maxl^" 1 ) Vmax(S ) 
^min(VF) 2 , . 2 /S \ 2 12 £ 



'/-'max 



Vmin (©o) > ai 



13 
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Given fio G 5+ + n S P E as guaranteed in (93), let us define a new convex set: 

u n (n ) -.= (s p ++ n s p E ) -n = {B- n \B g n s£} c s£ 

which is a translation of the original convex set n S E . Let be a matrix with all entries 
being zero. Thus it is clear that U n (Qo) 3 given that f&o G 5+ + n S P E . Define for R n as in 
expression (30), 

Q(fi) := ^(O)-^(fio) =tr(fir n )-log|fi|-tr(fi r„) + log|fi | 

= tr f(n - O )(fn - $0)) - (log |fi| - log |O |) + tr ( (O - fi )tfo) • 



(94) 
(95) 



For an appropriately chosen r n and a large enough M > 0, let 



n, 



{A G U n (n ), \\A\\ F = Mr n }, and 
{A G U n {h ), \\A\\ F < Mr n }. 



It is clear that both U n and T n U Ii n art convex. It is also clear that G II n . Define for A G 

u n (n ), 

G(A) := Q(n + A) = tr(A(f n - *„)) - (log |f>o + A| - log |n |) + tr(A¥ ) (96) 
It is clear that G(A) is a convex function on U n (Qo) and G(0) = Q(£Iq) = 0. 
Now, Q n minimizes Q(Cl), or equivalently A = Q, n — Qq minimizes G(A). Hence by definition, 

G(A) < G(0) = 
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Note that T n is non-empty, while clearly G Ii n . Indeed, consider B € := (1 + e)fio, where 



e > 0; it is clear that B e - fi G S++ n 5p and 



B F - fir 



fir 



Mr n for lei 



Mr n / 



fir 



. Note also if A G T n , then A {j = 0V(i, j : i ^ j) <£ E; Thus we have A € 5^ and 



||A|| = ||diag(A)|| + ||offd(A)|| <p + 2\E\ where |P| = Un(5 ,„). (97) 
We now show the following proposition. 

Proposition 31 Under (36), we have for all A 6 T„ such that ||A||^ = Mr n for r n as in (76), 
fio + y A y 0, Vf G an open interval I D [0, 1] on even? £. 

Proof In view of Proposition 23, it is sufficient to show that fio + (1 + e)A, fio — ^A >- for 
some e > 0. Indeed, by definition of A G T n , we have </? mm (fio + A) ^ on event £; thus 

(fi + (l+£)A) > 

Vmin 

(fi + A) -e||A|| 2 > 
and ^ min (fio - eA) > <^ m i n (fio) - £ II A|| 2 > 12cr min c/13 - e || A|| 2 > 

for e > that is sufficiently small. ■ 

Thus we have that log |firj + uA| is infinitely differentiable on the open interval I D [0, 1] of u. 
This allows us to use the Taylor's formula with integral remainder to obtain the following: 

Lemma 32 On event £ n X , (5(A) > Ofor all A G T n , 

Proof Let us use A as a shorthand for 

vecA T (^J (1 - «)(fi + vA)" 1 ® (fi + ^A) _1 (i^ vecA, 

where ® is the Kronecker product (if = (%-) roXri , P = (b ke ) pxq , then W^P = {wijP) mpxnq ), 

2 

and vecA G M p is A pxp vectorized. Now, the Taylor expansion gives for all A G T n , 



log | fi + A| - log | fi 1 



— log | fi + uA||„ =0 A + 
av 

tr($ A) - A. 



r 1 d 2 



log | fio + vA\dv 



Hence for all A G T n , 



G(A) = A + tr(A(r„-tfo)) = i + tr A(r n -$ ) -ti A($ -$ 



(98) 



where we first bound tr(A(\&o — ^o)) as follows: by (33) and (72), we have on event £ 



tr(A(* - *o)) 



< HA 



(A,(tf - 
13r„. 



< IIAI 



lF 12a 2 c 2 



(99) 
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where we bound 



< 



as follows: 

l ||Qo,p||j7 

^min ^min(@o)¥>min(®oJ 



< maxW- 



< C b i as y/2So tn log p/n 13r n 



12(7= c2/13 - 12< in c2 



Now, conditioned on event Xq, by (89) 



max \T n j k - p jk\ < 4C 3 ^log max(n,p)/n =: 8 n 



and thus on event £ n -Xq, we have 



tr(A(f n - tf )) < $n |offd(A)| 1 , where |offd(A)| 1 < 



VI|offd(A)|| ||offd(A)|| F < V2|£j||A|| F ,and 



tr A(r n -* ) > -4C 3 v / logmax(n,p)/n x /2M||A|| F > -4C 3 r n ||A|| F . (100) 



(101) 



Finally, we bound A. First we note that for A G T„, we have on event £, 

3{T Ty , „ „ 



lAIn < IA 



8k 



given (34): n > (§ • J;) 2 ^ (4C 3 + 



max (2|£'|) log max(n,p), C^ ias 25 ,„ logp}. 



Now we have by (91) and (37) following Rothman et al. [2008] (see Page 502, proof of Theorem 1 
therein): on event £, 



A > \\A\\%/ 2 Vmax («o) + ||A 



> l|A|| F / \2o 



4 

max 



i _e_ _3_ 

fc + 13 + 8k 



> IIAI 



2k 2 

9(7 4 



Now on event 8 n AT , for all A G T n , we have by (98),(102), (100), and (99), 

13r n 



G{A) > ||A|| F 



IAI 



2 2fc 2 

9cr 4 

max 



2fc 2 



4C 3 r n ||A|| F - ||A|| F 



1 



So'max l|A|| F 



4C 3 r n + 



12n 2 r 2 



13rv 



12rx 2 r 2 



|A|| F I 



/ 2fc 2 



V ^O'm; 



1 

M 



4C 3 + 



13 



12cj 2 ; „c 2 



(102) 



hence we have G(A) > for M large enough, in particular M = (9cr max /(2fc 2 )) (4C 3 + 13/(12o\ 
suffices. ■ 

The rest of the proof follows that of Theorem 19, see Proposition 26 and the bounds which follow. 
We thus establish that the theorem holds. ■ 



2 

min 
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Appendix F. Oracle inequalities for the Lasso 

In this section, we consider recovering ft G W in the following linear model: 

Y = X/3 + e, (103) 

where X follows (16) and e ~ N(0, a 2 I n ). Recall given A n , the Lasso estimator for /3 G W is 
defined as: 

P = argmin^||y - X/3\\ 2 + A n ||/3||i, (104) 

which corresponds to the regression function in (10) by letting Y := X\ and X := Xw where 
X.\i denotes columns of X without i. Define so as the smallest integer such that 

p 

^min(/3?,AV) < s AV, where A = y/2\ogp/n. (105) 

i=i 

For X G F(6) as defined in (43), define 

< (1 + 9)\ <7 ,a, P , where X G J"(6>), for < 9 < 1 J, (106) 

where Ao-^p = oVl + o\/ (21ogp)/n, where a > 0. We have (cf. Lemma 34) 

P (T a ) > 1 - (Vvrlogpp' 1 )- 1 ; (107) 

II V ■ 1 1 

In fact, for such a bound to hold, we only need 11 ^f < 1 + 6, Vj to hold in F{9). 

We now state Theorem 33, which may be of independent interests as the bounds on 1-2 and i\ loss 
for the Lasso estimator are stated with respect to the actual sparsity sq rather than s = \ supp (/3)| 
as in Bickel et al. [2009, Theorem 7.2]. The proof is omitted as on event T a n X, it follows exactly 
that of Zhou [2010b, Theorem 5.1] for a deterministic design matrix X which satisfies the RE 
condition, with some suitable adjustments on the constants. 

Theorem 33 (Oracle inequalities of the Lasso) Zhou [2010b] Let Y = Xf3 + e, for e being 
Ltd. N(0,a 2 ) and let X follow (16). Let sq be as in (105) and Tq denote locations of the sq 
largest coefficients of (5 in absolute values. Suppose that RE{sq,A, So) holds with K(sq,4, So) 
and p m in(s) > 0. Fix some 1 > 9 > 0. Let /3i n j t be an optimal solution to (104) with 

Xn = d \a > 2(1 + 9)X a ^ p (108) 

where a > 1 and do > 2(1 + 9)y/l + a. Let h = /3; n it — /?t - Define 

X :=lZ{9)r\F(9)C\M{9). 

Suppose that n satisfies (51). Then onT a C\ X, we have 

\\PMt-P\\ 2 < A nx /ioY 2D 2 + 2D 2 + 2 := Xa^d ^2D 2 + 2D 2 + 2, 
||/iT n c ||i < D 1 X n s := DidoXaso, 
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where Dq and D\ are defined in (109) and (110) respectively, and¥ (X C\T a ) > 1— 3exp(— c0 2 n/a 4 
(y/ir logpp") -1 . 

Let T\ denote the so largest positions of h in absolute values outside of To; Let Tq\ := Tq U T\. 
The proof of Theorem 33 yields the following bounds on X n T a - \\hr 01 1| 2 < A}^o^ cr v / 'So where 



n S D o /n/i i m ^o,4,Eo)vV m ax(s-so) , 3^ 2 ( 5 q,4,£ ) 
= -, 2^2(1 + 9) ^—^ + 



(109) 



max 

(s - s ) . 2(1 + 6') 4 / o max (3s )p 
~ (1-0)^~J2^) ' d (l - 6) 2 p min {2s ) 

and 

2 ' 



. 4(1 + 0) 2 pmax(s - S ) / (1 + (?) v/Pmax(s " S ) 3K(s A^o) , 

Dl _ ma x < ^ , I - + 2(1 _ 0) I 



We note that implicit in these constants, we have used the concentration bounds for A max (3so)> 
A m ax(s— so) an d A m j n (2so) as derived in Theorem 10, given that (49) holds for m < max(s, (ko+ 
l)so), where we take feo > 3. In general, these maximum sparse eigenvalues as defined above will 
increase with so and s; Taking this issue into consideration, we fix for c® > 4\/2, A n = d^Xa 
where 



do = Co(l + 0) 2 y/p max (s ~ So)Pmax(3s ) > 2(1 + 9)y/l + a, 

where the second inequality holds for a = 7 as desired, given p m ax(3so), /? m ax(s — so) > 1- 
Thus we have for p max {3s ) > yO m ax(2s ) > Pmin(2s ) 

3 2 



D/do < 



Cq(1 + 6>)(1 - 0)v / Pmax(3s O ) V / /Omin(2so) C§(1 - 0) 2 p min (2s O ) 



3v / Pmin(2so) , 2 



O0(l - ^) 2 \//Omax(3so) / O m in(2so) 4(1 - 0) 2 p mm (2s O ) 

< 2(3co + 2)^ 2 (s ,4,S ) < 7^ 2 (s ,4,Sq) 



c 2 (l - Of 

which holds given that /3 m ax(3so) > L an d 1 < , 1 < y/2K(s$, ko, So), and thus 

V(°min(2so) 

— t — ?r^r < 2 as shown in Lemma 35; Hence 



n . J n/ , (4 + 3^C ) V / Pmax(s " S )p ma x(3s )(l + ^) 2 ^ 2 (s , 4, E ) 

Do < maxjD/do, 

^ 7^ 2 ( S0 ,4,S ) ^ 5^ 2 (s ,4,S ) , 

- V2(i-ey K {i-ey and 

A < f^^V^o,4, S o)< 49 ^ ( - 4 ' So) 



4(1-0) 4 J y u ' ' u/ - 16(1 -#) 2 
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where for both D\, we have used the fact that 

2(1 + #) 2 /w(s - so) 2 



< 



d 2 Q cg(l + 0) 2 /w(3s o ) " cg(l + 0)Vmin(2s o ) 

/f 2 (s ,4,E ( 

Cg(l + fl)2 



< 4K 2 ( So ,4,Sq) < ^ 2 ( So ,4,Sq) 



Appendix G. Misc bounds 

Lemma 34 For fixed design X with max.,- ||-Xj - ||2 < (1 + 9)y/n, where < 9 < 1, we have for 
T a as defined in (106), where a > 0, P ("77) < (^tt logpp ) -1 . 



Proof Define random vaiiables: lj = ^ Xat=i e i^i,j- Note that maxi<j-<p = ||X e/n||oo. 
We have E(Y,) = and Var ((Yj)) = \\Xj || 2 a 2 /n 2 < (1 + 6)a 2 /n. Let a = 1 + 9. Obviously, 

C 2 cr 2 

Yj has its tail probability dominated by that of Z ~ iV(0, ): 



2ci<r / — ni 



^ *) < P (I Z I > *) < 7= ex P 



/2 



3 s/2mit \2c\(t 2 

We can now apply the union bound to obtain: 

( , , \ c x o ( -nt 2 \ 

P max \Yj \ > t ] < p—=- exp — ^— ^ 
Vi<i<P 1 J ~ y/nt y \2c 2 a 2 J 

nt 2 . txprm 



exp( - ( ^ + log 72^- logp 



By choosing t = cioVl + ay/2 log p/n, the right-hand side is bounded by (y/ir logpp a ) 1 for 
a > 0. ■ 



Lemma 35 (Zhou [2010a]) Suppose that RE(sq, ko, So) holds for ko > 0, then for m = (ko + 
l)so, 



VPmm(m) > — — ; and clearly 

y/2 + k$K(so,ko,Z ) 



1 



if S 0) ii = l,Vi, tfiew 1 > v / /°min(2s ) > ^ — — fark >l. 

y/2K (s , «o, S ) 
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